<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Getting concurrent issue on delta table using  liquid clustering in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/getting-concurrent-issue-on-delta-table-using-liquid-clustering/m-p/121454#M46456</link>
    <description>&lt;P&gt;&lt;SPAN&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/88823"&gt;@Walter_C&lt;/a&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;We are using Liquid Clustering as our first strategy. Our Databricks Runtime is 13.3, and we have a table named status_update containing approximately 30 market IDs, each with a single record. In our pipeline, if any market fails, we need to update the status of the table to 'failed'. We are updating the status in parallel, but we are encountering concurrency issues. Does this mean that Liquid Clustering does not work effectively when we use the UPDATE statement on that table?&lt;/P&gt;</description>
    <pubDate>Wed, 11 Jun 2025 11:02:10 GMT</pubDate>
    <dc:creator>Anand13</dc:creator>
    <dc:date>2025-06-11T11:02:10Z</dc:date>
    <item>
      <title>Getting concurrent issue on delta table using  liquid clustering</title>
      <link>https://community.databricks.com/t5/data-engineering/getting-concurrent-issue-on-delta-table-using-liquid-clustering/m-p/120712#M46237</link>
      <description>&lt;P&gt;In our project, we are testing liquid clustering using a test table called status_update, where we need to update the status for different market IDs. We are attempting to update the status_update table in parallel using the UPDATE command.&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;ALTER&lt;/SPAN&gt; &lt;SPAN class=""&gt;TABLE&lt;/SPAN&gt; status_update CLUSTER &lt;SPAN class=""&gt;BY&lt;/SPAN&gt; (mkt_id) spark.sql(f"UPDATE status_update SET status='{status}' WHERE mkt_id={mkt_id}")&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;However, when running the notebook in parallel for different market IDs, we encounter a concurrency issue.&lt;/P&gt;</description>
      <pubDate>Mon, 02 Jun 2025 12:12:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/getting-concurrent-issue-on-delta-table-using-liquid-clustering/m-p/120712#M46237</guid>
      <dc:creator>Anand13</dc:creator>
      <dc:date>2025-06-02T12:12:32Z</dc:date>
    </item>
    <item>
      <title>Re: Getting concurrent issue on delta table using  liquid clustering</title>
      <link>https://community.databricks.com/t5/data-engineering/getting-concurrent-issue-on-delta-table-using-liquid-clustering/m-p/120734#M46244</link>
      <description>&lt;P class="_1t7bu9h1 paragraph"&gt;To address concurrency issues and optimize parallel updates in Spark SQL using the &lt;CODE&gt;UPDATE&lt;/CODE&gt; command on the &lt;CODE&gt;status_update&lt;/CODE&gt; table, consider the following strategies:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Row-Level Logical Conflict Resolution&lt;/STRONG&gt;: Tables with liquid clustering enabled in Databricks Runtime 13.3 and above support row-level concurrency. This minimizes transaction and clustering conflicts during operations like &lt;CODE&gt;UPDATE&lt;/CODE&gt;, &lt;CODE&gt;MERGE&lt;/CODE&gt;, and &lt;CODE&gt;DELETE&lt;/CODE&gt;. For such tables, enabling row tracking can improve performance and reduce write conflicts related to row-level concurrency&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Mitigate Concurrent Updates&lt;/STRONG&gt;: Concurrent updates can fail due to version mismatches of the Delta table. To reduce this, avoid running &lt;CODE&gt;UPDATE&lt;/CODE&gt; commands concurrently. Alternatively, consider staging updates in intermediate tables to ensure atomicity, and prevent queries from scanning the target table multiple times during the update operation&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Optimize Liquid Clustering&lt;/STRONG&gt;: Liquid clustering improves query performance by organizing data according to clustering keys derived from historical query patterns. It provides incremental clustering, which avoids unnecessary write amplification. Scheduling regular &lt;CODE&gt;OPTIMIZE&lt;/CODE&gt; jobs (e.g., every one or two hours) can further enhance data layout for better performance during concurrent operations&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Cluster Configuration and Schema Settings&lt;/STRONG&gt;: Use configurations such as:&lt;/P&gt;
&lt;UL class="_1t7bu9h7 _1t7bu9h2"&gt;
&lt;LI&gt;&lt;CODE&gt;spark.databricks.delta.merge.enableLowShuffle&lt;/CODE&gt;: Enables low-shuffle operations to retain existing data organization while efficiently updating and merging&lt;/LI&gt;
&lt;LI&gt;&lt;CODE&gt;delta.enableDeletionVectors&lt;/CODE&gt;: This configuration aids concurrent updates by enabling deletion vectors and row-level concurrency, reducing write conflicts&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Avoid Table Scans&lt;/STRONG&gt;: To minimize concurrency issues, reduce the number of table scans during operations such as &lt;CODE&gt;INSERT&lt;/CODE&gt; or &lt;CODE&gt;UPDATE&lt;/CODE&gt;. Persist intermediate data separately to take the load off the target table and reduce read/write conflicts.&lt;/P&gt;
&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Mon, 02 Jun 2025 16:38:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/getting-concurrent-issue-on-delta-table-using-liquid-clustering/m-p/120734#M46244</guid>
      <dc:creator>Walter_C</dc:creator>
      <dc:date>2025-06-02T16:38:08Z</dc:date>
    </item>
    <item>
      <title>Re: Getting concurrent issue on delta table using  liquid clustering</title>
      <link>https://community.databricks.com/t5/data-engineering/getting-concurrent-issue-on-delta-table-using-liquid-clustering/m-p/121454#M46456</link>
      <description>&lt;P&gt;&lt;SPAN&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/88823"&gt;@Walter_C&lt;/a&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;We are using Liquid Clustering as our first strategy. Our Databricks Runtime is 13.3, and we have a table named status_update containing approximately 30 market IDs, each with a single record. In our pipeline, if any market fails, we need to update the status of the table to 'failed'. We are updating the status in parallel, but we are encountering concurrency issues. Does this mean that Liquid Clustering does not work effectively when we use the UPDATE statement on that table?&lt;/P&gt;</description>
      <pubDate>Wed, 11 Jun 2025 11:02:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/getting-concurrent-issue-on-delta-table-using-liquid-clustering/m-p/121454#M46456</guid>
      <dc:creator>Anand13</dc:creator>
      <dc:date>2025-06-11T11:02:10Z</dc:date>
    </item>
  </channel>
</rss>

