@bianca_unifeye  you have covered most of them. 

I would like to add a few,

With respect to choosing data format, i would recommend delta format.

Below are some areas of improvements while using delta,

1. enable deletion vector for improved performance 

2. enable liquid clustering ( improves performance and partition size )

3. delta checkpoint interval ( a checkpoint file gets created at the mentioned interval, helps for faster read )

4. delta auto compact and auto optimize can be enabled.

with respect to spark,

1. enable AQE (https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution). Based on the requirement try to enable the config parameters under AQE. ( handles most of the common performance issues )

2. Use dataframes or SQL, as it has catalyst optimizer and when enabled with photon gives the best performance ( but photon comes with cost, so think which is important for you time or money )

View solution in original post