Optimizing Spark Read Performance on Delta Tables with Deletion Vectors Enabled
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Friday
Hi Databricks Experts,
I'm currently using Delta Live Table to generate master data managed within Unity Catalog, with the data stored directly in Google Cloud Storage. I then utilize Spark to read these master data from the GCS bucket. However, Iām facing a significant slowdown in Spark processing.
After some investigation, I suspect that the root cause might be related to the deletion vectors. Hereās what I've tried so far to optimize the tables:
- `OPTIMIZE delta.path FULL;`
- `REORG TABLE delta.path APPLY(PURGE);`
- `VACUUM delta.path RETAIN 2 HOURS;`
Despite these efforts, the performance improvements have been minimal. I'm particularly puzzled as to why the `OPTIMIZE` command doesnāt seem to be having any meaningful effect.
My questions are:
1. **How can I optimize Spark read performance on Delta tables when deletion vectors are enabled?**
2. **Why might the `OPTIMIZE` command not improve performance as expected in this scenario?**
3. **Are there any alternative strategies or best practices to mitigate the performance impact caused by deletion vectors in Delta tables?**
I appreciate any insights or suggestions you might have. Thanks in advance for your help.
Hung Nguyen
- Labels:
-
Delta Lake
-
Spark
-
Workflows
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Friday
Hi Hung,
The performance issues you're experiencing with Delta tables and deletion vectors are common challenges when working with Delta Live Tables. Let me address your questions:
How to Optimize Spark Read Performance with Deletion Vectors
Deletion vectors can significantly impact read performance, especially when they accumulate over time. Here's how to optimize:
1. Strategic OPTIMIZE Scheduling: Run OPTIMIZE operations after significant write/delete operations rather than on a fixed schedule.
2. Proper VACUUM Implementation: Ensure your VACUUM operations are actually removing the physical files by checking the retention period against your table history.
3. Monitor Deletion Vector Accumulation: Use system tables like `delta.table_history` to track how many deletion vectors are being created and applied.
4. Consider Disabling Deletion Vectors: If your workload is read-heavy with infrequent writes, consider disabling deletion vectors for those specific tables.
Why OPTIMIZE May Not Improve Performance
Your OPTIMIZE commands might not be improving performance for several reasons:
1. Deletion Vectors Still Exist: OPTIMIZE alone doesn't remove deletion vectors completely - it applies them to create new data files but the deletion vector files themselves may still exist until VACUUM is run.
2. Incomplete Application: OPTIMIZE might not rewrite all files with deletion vectors, especially if they don't meet compaction criteria.
3. Read Overhead Remains: After OPTIMIZE, if deletion vectors aren't fully purged, readers still need to process them, causing overhead.
4. Timing Issues: The performance benefits of OPTIMIZE might be negated if new deletion vectors are created shortly after optimization.
Alternative Strategies to Mitigate Performance Impact
1. Force Hard Deletes for Critical Tables: For tables where read performance is critical, implement a workflow that ensures hard deletes rather than relying on deletion vectors:
- Run DELETE operations followed by OPTIMIZE
- Use REORG TABLE with APPLY (PURGE) explicitly
- Follow with VACUUM using appropriate retention periods
2. Partition Optimization: Ensure your tables are properly partitioned to limit the scope of operations that generate deletion vectors. Traditional Hive partitioning is not recommended. Delta Liquid Clustering is the way to go!
3. Batch Updates: Consolidate your update/delete operations to minimize the frequency of deletion vector creation.
4. Selective Deletion Vector Usage: Consider a hybrid approach where deletion vectors are only enabled for specific tables or partitions based on their access patterns.
5. Read-Optimized Copies: For critical read workloads, consider maintaining read-optimized copies of tables where deletion vectors are regularly purged.
6. Upgrade Runtime: Ensure you're using Databricks Runtime 14.3 LTS or above, which includes optimizations for deletion vectors.
Remember that deletion vectors trade faster writes for potentially slower reads. If your workload is read-heavy, you may need to be more aggressive with your optimization strategy or reconsider whether deletion vectors are appropriate for your use case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Friday
Hi @BigRoux ,
Thank you very much for your detailed and easy-to-understand explanationāit was incredibly helpful in addressing the issue. Your guidance has been a major asset in my troubleshooting process.
However, I have one further question that I hope you can shed some light on. And provide more context. Iām currently using the Play Framework in conjunction with Spark 3.3.2 and Delta 2.3 to build an API that reads data directly from Google Cloud Storage. Iāve compared the performance across three different scenarios:
1. **Spark on Databricks Runtime (v16):** Using clustering on the same source, the performance is excellentāapproximately 7 seconds.
2. **Google Big Lake:** Reading from the same source also yields good performance, around 6-7 seconds.
3. **Self-hosted Spark on Play Server (Spark 3.3.2, Delta 2.3):** The performance is extremely slowāaround 2 minutes.
All three methods share the same network topology and read from the same data source. Given these conditions, why is there such a large discrepancy in performance between the self-hosted Spark setup and the other two environments?
Do you have any suggestions or insights into why this might be happening? This issue is really proving to be a challenging puzzle!
Thanks again for your help.
Hung Nguyen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Monday
The reason Spark is so much faster on Databricks is because it is a managed service. We have full control over the integration along storage, memory management, and networking to name a few. Spark running on Databricks is anywhere between 5 - 100x faster depending on the workload.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Monday
Hi @BigRoux , thanks for your explaination.
In case we can't optimize spark locally as fast as databicks. Do you have any suggestion for us to optimize performance in this scenario?
Hung Nguyen

