Optimizing Spark Read Performance on Delta Tables with Deletion Vectors Enabled
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
Hi Databricks Experts,
I'm currently using Delta Live Table to generate master data managed within Unity Catalog, with the data stored directly in Google Cloud Storage. I then utilize Spark to read these master data from the GCS bucket. However, I’m facing a significant slowdown in Spark processing.
After some investigation, I suspect that the root cause might be related to the deletion vectors. Here’s what I've tried so far to optimize the tables:
- `OPTIMIZE delta.path FULL;`
- `REORG TABLE delta.path APPLY(PURGE);`
- `VACUUM delta.path RETAIN 2 HOURS;`
Despite these efforts, the performance improvements have been minimal. I'm particularly puzzled as to why the `OPTIMIZE` command doesn’t seem to be having any meaningful effect.
My questions are:
1. **How can I optimize Spark read performance on Delta tables when deletion vectors are enabled?**
2. **Why might the `OPTIMIZE` command not improve performance as expected in this scenario?**
3. **Are there any alternative strategies or best practices to mitigate the performance impact caused by deletion vectors in Delta tables?**
I appreciate any insights or suggestions you might have. Thanks in advance for your help.
Regards,
- Labels:
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
Hi @oliwia823 , we got this question answered recently. I guess this might help
https://community.databricks.com/t5/data-engineering/optimizing-spark-read-performance-on-delta-tabl...

