Databricks Community

oliwia823 · 2 weeks ago

Hi Databricks Experts,

I'm currently using Delta Live Table to generate master data managed within Unity Catalog, with the data stored directly in Google Cloud Storage. I then utilize Spark to read these master data from the GCS bucket. However, I’m facing a significant slowdown in Spark processing.

After some investigation, I suspect that the root cause might be related to the deletion vectors. Here’s what I've tried so far to optimize the tables:

- `OPTIMIZE delta.path FULL;`

- `REORG TABLE delta.path APPLY(PURGE);`

- `VACUUM delta.path RETAIN 2 HOURS;`

Despite these efforts, the performance improvements have been minimal. I'm particularly puzzled as to why the `OPTIMIZE` command doesn’t seem to be having any meaningful effect.

My questions are:

1. **How can I optimize Spark read performance on Delta tables when deletion vectors are enabled?**

2. **Why might the `OPTIMIZE` command not improve performance as expected in this scenario?**

3. **Are there any alternative strategies or best practices to mitigate the performance impact caused by deletion vectors in Delta tables?**

I appreciate any insights or suggestions you might have. Thanks in advance for your help.

Regards,

e-zpassnj

ashraf1395 · 2 weeks ago

Hi @oliwia823 , we got this question answered recently. I guess this might help

https://community.databricks.com/t5/data-engineering/optimizing-spark-read-performance-on-delta-tabl...

Databricks Community

Optimizing Spark Read Performance on Delta Tables with Deletion Vectors Enabled

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!