- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-24-2025 08:21 PM - edited 11-24-2025 08:35 PM
@bianca_unifeye you have covered most of them.
I would like to add a few,
With respect to choosing data format, i would recommend delta format.
Below are some areas of improvements while using delta,
1. enable deletion vector for improved performance
2. enable liquid clustering ( improves performance and partition size )
3. delta checkpoint interval ( a checkpoint file gets created at the mentioned interval, helps for faster read )
4. delta auto compact and auto optimize can be enabled.
with respect to spark,
1. enable AQE (https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution). Based on the requirement try to enable the config parameters under AQE. ( handles most of the common performance issues )
2. Use dataframes or SQL, as it has catalyst optimizer and when enabled with photon gives the best performance ( but photon comes with cost, so think which is important for you time or money )