cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Optimizing Spark Read Performance on Delta Tables with Deletion Vectors Enabled

oliwia823
New Contributor

Hi Databricks Experts,

I'm currently using Delta Live Table to generate master data managed within Unity Catalog, with the data stored directly in Google Cloud Storage. I then utilize Spark to read these master data from the GCS bucket. However, I’m facing a significant slowdown in Spark processing.

After some investigation, I suspect that the root cause might be related to the deletion vectors. Here’s what I've tried so far to optimize the tables:

- `OPTIMIZE delta.path FULL;`

- `REORG TABLE delta.path APPLY(PURGE);`

- `VACUUM delta.path RETAIN 2 HOURS;`

Despite these efforts, the performance improvements have been minimal. I'm particularly puzzled as to why the `OPTIMIZE` command doesn’t seem to be having any meaningful effect.

My questions are:

1. **How can I optimize Spark read performance on Delta tables when deletion vectors are enabled?**

2. **Why might the `OPTIMIZE` command not improve performance as expected in this scenario?**

3. **Are there any alternative strategies or best practices to mitigate the performance impact caused by deletion vectors in Delta tables?**

I appreciate any insights or suggestions you might have. Thanks in advance for your help.

Regards,

1 REPLY 1

ashraf1395
Honored Contributor

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now