Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-20-2023 04:22 PM
I have a bunch of data frames from different data sources. They are all time series data in order of a column timestamp, which is an int32 Unix timestamp. I can join them together by this and another column join_idx which is basically an integer index ordered by timestamp. When I join these data frames together, the resulting data frame is now out of order, which results in extremely slow RANGE BETWEEN queries. I need to be able to bucket the times into non-rolling windows as part of point in range queries. Is there any way that I can use Databricks' localization based on timestamps to keep the data collected?
df_all = (
df_1.join(df_2, on=['timestamp', 'join_idx'], how="inner")
.join(df_3, on=['timestamp', 'join_idx'], how="inner")
.join(df_4, on=['timestamp', 'join_idx'], how="inner")
.withColumn("id", lit(id))
).writeTo("joined_table").createOrReplace()
Labels:
- Labels:
-
Timeseries