Delta table acceleration for group by on key colum...

chanansh · ‎02-03-2023

What is the best practice for accelerating queries which looks like the following?

win = Window.partitionBy('key1','key2').orderBy('timestamp') 
df.select('timestamp', (F.col('col1') - F.lag('col1').over(win)).alias('col1_diff'))

I have tried to use OPTIMIZE table ZORDER key1, key2 but it does not work well.

There is a lot of shuffling going on. The data is partitioned by date which is a generated column from timestamp.

Delta table acceleration for group by on key columns using ZORDER does not work