<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Optimizing Spark Read Performance on Delta Tables with Deletion Vectors Enabled in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/optimizing-spark-read-performance-on-delta-tables-with-deletion/m-p/114578#M44874</link>
    <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/34815"&gt;@Louis_Frolio&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Thank you very much for your detailed and easy-to-understand explanation—it was incredibly helpful in addressing the issue. Your guidance has been a major asset in my troubleshooting process.&lt;/P&gt;&lt;P&gt;However, I have one further question that I hope you can shed some light on. And provide more context. I’m currently using the &lt;STRONG&gt;Play Framework&lt;/STRONG&gt; in conjunction with &lt;STRONG&gt;Spark 3.3.2&lt;/STRONG&gt; and &lt;STRONG&gt;Delta 2.3&lt;/STRONG&gt; to build an API that reads data directly from Google Cloud Storage. I’ve compared the performance across three different scenarios:&lt;/P&gt;&lt;P&gt;1. **Spark on Databricks Runtime (v16):** Using clustering on the same source, the performance is excellent—approximately 7 seconds.&lt;/P&gt;&lt;P&gt;2. **Google Big Lake:** Reading from the same source also yields good performance, around 6-7 seconds.&lt;/P&gt;&lt;P&gt;3. **Self-hosted Spark on Play Server (Spark 3.3.2, Delta 2.3):** The performance is &lt;STRONG&gt;extremely slow&lt;/STRONG&gt;—around &lt;STRONG&gt;2 minutes&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;All three methods share the same network topology and read from the same data source. Given these conditions, why is there such a large discrepancy in performance between the self-hosted Spark setup and the other two environments?&lt;/P&gt;&lt;P&gt;Do you have any suggestions or insights into why this might be happening? This issue is really proving to be a challenging puzzle!&lt;/P&gt;&lt;P&gt;Thanks again for your help.&lt;/P&gt;</description>
    <pubDate>Sat, 05 Apr 2025 01:55:26 GMT</pubDate>
    <dc:creator>minhhung0507</dc:creator>
    <dc:date>2025-04-05T01:55:26Z</dc:date>
    <item>
      <title>Optimizing Spark Read Performance on Delta Tables with Deletion Vectors Enabled</title>
      <link>https://community.databricks.com/t5/data-engineering/optimizing-spark-read-performance-on-delta-tables-with-deletion/m-p/114523#M44854</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hi Databricks Experts,&lt;/P&gt;&lt;P&gt;I'm currently using Delta Live Table to generate master data managed within Unity Catalog, with the data stored directly in Google Cloud Storage. I then utilize Spark to read these master data from the GCS bucket. However, I’m facing a significant slowdown in Spark processing.&lt;/P&gt;&lt;P&gt;After some investigation, I suspect that the root cause might be related to the deletion vectors. Here’s what I've tried so far to optimize the tables:&lt;/P&gt;&lt;P&gt;- `OPTIMIZE delta.path FULL;`&lt;/P&gt;&lt;P&gt;- `REORG TABLE delta.path APPLY(PURGE);`&lt;/P&gt;&lt;P&gt;- `VACUUM delta.path RETAIN 2 HOURS;`&lt;/P&gt;&lt;P&gt;Despite these efforts, the performance improvements have been minimal. I'm particularly puzzled as to why the `OPTIMIZE` command doesn’t seem to be having any meaningful effect.&lt;/P&gt;&lt;P&gt;My questions are:&lt;/P&gt;&lt;P&gt;1. **How can I optimize Spark read performance on Delta tables when deletion vectors are enabled?**&lt;/P&gt;&lt;P&gt;2. **Why might the `OPTIMIZE` command not improve performance as expected in this scenario?**&lt;/P&gt;&lt;P&gt;3. **Are there any alternative strategies or best practices to mitigate the performance impact caused by deletion vectors in Delta tables?**&lt;/P&gt;&lt;P&gt;I appreciate any insights or suggestions you might have. Thanks in advance for your help.&lt;/P&gt;</description>
      <pubDate>Fri, 04 Apr 2025 13:33:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimizing-spark-read-performance-on-delta-tables-with-deletion/m-p/114523#M44854</guid>
      <dc:creator>minhhung0507</dc:creator>
      <dc:date>2025-04-04T13:33:24Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing Spark Read Performance on Delta Tables with Deletion Vectors Enabled</title>
      <link>https://community.databricks.com/t5/data-engineering/optimizing-spark-read-performance-on-delta-tables-with-deletion/m-p/114559#M44869</link>
      <description>&lt;P&gt;Hi Hung,&lt;/P&gt;
&lt;P&gt;The performance issues you're experiencing with Delta tables and deletion vectors are common challenges when working with Delta Live Tables. Let me address your questions:&lt;/P&gt;
&lt;P&gt;How to Optimize Spark Read Performance with Deletion Vectors&lt;/P&gt;
&lt;P&gt;Deletion vectors can significantly impact read performance, especially when they accumulate over time. Here's how to optimize:&lt;/P&gt;
&lt;P&gt;1. Strategic OPTIMIZE Scheduling: Run OPTIMIZE operations after significant write/delete operations rather than on a fixed schedule.&lt;/P&gt;
&lt;P&gt;2. Proper VACUUM Implementation: Ensure your VACUUM operations are actually removing the physical files by checking the retention period against your table history.&lt;/P&gt;
&lt;P&gt;3. Monitor Deletion Vector Accumulation: Use system tables like `delta.table_history` to track how many deletion vectors are being created and applied.&lt;/P&gt;
&lt;P&gt;4. Consider Disabling Deletion Vectors: If your workload is read-heavy with infrequent writes, consider disabling deletion vectors for those specific tables.&lt;/P&gt;
&lt;P&gt;Why OPTIMIZE May Not Improve Performance&lt;/P&gt;
&lt;P&gt;Your OPTIMIZE commands might not be improving performance for several reasons:&lt;/P&gt;
&lt;P&gt;1. Deletion Vectors Still Exist: OPTIMIZE alone doesn't remove deletion vectors completely - it applies them to create new data files but the deletion vector files themselves may still exist until VACUUM is run.&lt;/P&gt;
&lt;P&gt;2. Incomplete Application: OPTIMIZE might not rewrite all files with deletion vectors, especially if they don't meet compaction criteria.&lt;/P&gt;
&lt;P&gt;3. Read Overhead Remains: After OPTIMIZE, if deletion vectors aren't fully purged, readers still need to process them, causing overhead.&lt;/P&gt;
&lt;P&gt;4. Timing Issues: The performance benefits of OPTIMIZE might be negated if new deletion vectors are created shortly after optimization.&lt;/P&gt;
&lt;P&gt;Alternative Strategies to Mitigate Performance Impact&lt;/P&gt;
&lt;P&gt;1. Force Hard Deletes for Critical Tables: For tables where read performance is critical, implement a workflow that ensures hard deletes rather than relying on deletion vectors:&lt;BR /&gt;- Run DELETE operations followed by OPTIMIZE&lt;BR /&gt;- Use REORG TABLE with APPLY (PURGE) explicitly&lt;BR /&gt;- Follow with VACUUM using appropriate retention periods&lt;/P&gt;
&lt;P&gt;2. Partition Optimization: Ensure your tables are properly partitioned to limit the scope of operations that generate deletion vectors. Traditional Hive partitioning is not recommended.&amp;nbsp; Delta Liquid Clustering is the way to go!&lt;/P&gt;
&lt;P&gt;3. Batch Updates: Consolidate your update/delete operations to minimize the frequency of deletion vector creation.&lt;/P&gt;
&lt;P&gt;4. Selective Deletion Vector Usage: Consider a hybrid approach where deletion vectors are only enabled for specific tables or partitions based on their access patterns.&lt;/P&gt;
&lt;P&gt;5. Read-Optimized Copies: For critical read workloads, consider maintaining read-optimized copies of tables where deletion vectors are regularly purged.&lt;/P&gt;
&lt;P&gt;6. Upgrade Runtime: Ensure you're using Databricks Runtime 14.3 LTS or above, which includes optimizations for deletion vectors.&lt;/P&gt;
&lt;P&gt;Remember that deletion vectors trade faster writes for potentially slower reads. If your workload is read-heavy, you may need to be more aggressive with your optimization strategy or reconsider whether deletion vectors are appropriate for your use case.&lt;/P&gt;</description>
      <pubDate>Fri, 04 Apr 2025 19:38:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimizing-spark-read-performance-on-delta-tables-with-deletion/m-p/114559#M44869</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-04-04T19:38:11Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing Spark Read Performance on Delta Tables with Deletion Vectors Enabled</title>
      <link>https://community.databricks.com/t5/data-engineering/optimizing-spark-read-performance-on-delta-tables-with-deletion/m-p/114578#M44874</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/34815"&gt;@Louis_Frolio&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Thank you very much for your detailed and easy-to-understand explanation—it was incredibly helpful in addressing the issue. Your guidance has been a major asset in my troubleshooting process.&lt;/P&gt;&lt;P&gt;However, I have one further question that I hope you can shed some light on. And provide more context. I’m currently using the &lt;STRONG&gt;Play Framework&lt;/STRONG&gt; in conjunction with &lt;STRONG&gt;Spark 3.3.2&lt;/STRONG&gt; and &lt;STRONG&gt;Delta 2.3&lt;/STRONG&gt; to build an API that reads data directly from Google Cloud Storage. I’ve compared the performance across three different scenarios:&lt;/P&gt;&lt;P&gt;1. **Spark on Databricks Runtime (v16):** Using clustering on the same source, the performance is excellent—approximately 7 seconds.&lt;/P&gt;&lt;P&gt;2. **Google Big Lake:** Reading from the same source also yields good performance, around 6-7 seconds.&lt;/P&gt;&lt;P&gt;3. **Self-hosted Spark on Play Server (Spark 3.3.2, Delta 2.3):** The performance is &lt;STRONG&gt;extremely slow&lt;/STRONG&gt;—around &lt;STRONG&gt;2 minutes&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;All three methods share the same network topology and read from the same data source. Given these conditions, why is there such a large discrepancy in performance between the self-hosted Spark setup and the other two environments?&lt;/P&gt;&lt;P&gt;Do you have any suggestions or insights into why this might be happening? This issue is really proving to be a challenging puzzle!&lt;/P&gt;&lt;P&gt;Thanks again for your help.&lt;/P&gt;</description>
      <pubDate>Sat, 05 Apr 2025 01:55:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimizing-spark-read-performance-on-delta-tables-with-deletion/m-p/114578#M44874</guid>
      <dc:creator>minhhung0507</dc:creator>
      <dc:date>2025-04-05T01:55:26Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing Spark Read Performance on Delta Tables with Deletion Vectors Enabled</title>
      <link>https://community.databricks.com/t5/data-engineering/optimizing-spark-read-performance-on-delta-tables-with-deletion/m-p/114717#M44917</link>
      <description>&lt;P&gt;The reason Spark is so much faster on Databricks is because it is a managed service.&amp;nbsp; We have full control over the integration along storage, memory management, and networking to name a few.&amp;nbsp; Spark running on Databricks is anywhere between 5 - 100x faster depending on the workload.&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 07 Apr 2025 12:52:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimizing-spark-read-performance-on-delta-tables-with-deletion/m-p/114717#M44917</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-04-07T12:52:12Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing Spark Read Performance on Delta Tables with Deletion Vectors Enabled</title>
      <link>https://community.databricks.com/t5/data-engineering/optimizing-spark-read-performance-on-delta-tables-with-deletion/m-p/114770#M44938</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/34815"&gt;@Louis_Frolio&lt;/a&gt;&amp;nbsp;, thanks for your explaination.&lt;/P&gt;&lt;P&gt;In case we can't optimize spark locally as fast as databicks. Do you have any suggestion for us to optimize performance in this scenario?&lt;/P&gt;</description>
      <pubDate>Tue, 08 Apr 2025 04:18:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimizing-spark-read-performance-on-delta-tables-with-deletion/m-p/114770#M44938</guid>
      <dc:creator>minhhung0507</dc:creator>
      <dc:date>2025-04-08T04:18:59Z</dc:date>
    </item>
  </channel>
</rss>

