How to keep data in time-based localized clusters ...

Erik_L · ‎04-20-2023

I have a bunch of data frames from different data sources. They are all time series data in order of a column timestamp, which is an int32 Unix timestamp. I can join them together by this and another column join_idx which is basically an integer index ordered by timestamp. When I join these data frames together, the resulting data frame is now out of order, which results in extremely slow RANGE BETWEEN queries. I need to be able to bucket the times into non-rolling windows as part of point in range queries. Is there any way that I can use Databricks' localization based on timestamps to keep the data collected?

df_all = (
        df_1.join(df_2, on=['timestamp', 'join_idx'], how="inner")
        .join(df_3, on=['timestamp', 'join_idx'], how="inner")
        .join(df_4, on=['timestamp', 'join_idx'], how="inner")
        .withColumn("id", lit(id))
    ).writeTo("joined_table").createOrReplace()

How to keep data in time-based localized clusters after joining?