What is the best practice for accelerating queries which looks like the following?
win = Window.partitionBy('key1','key2').orderBy('timestamp')
df.select('timestamp', (F.col('col1') - F.lag('col1').over(win)).alias('col1_diff'))
I have tried to use OPTIMIZE table ZORDER key1, key2 but it does not work well.
There is a lot of shuffling going on. The data is partitioned by date which is a generated column from timestamp.