<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Liquid Clustering With Merge in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/liquid-clustering-with-merge/m-p/135006#M50253</link>
    <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/191967"&gt;@Mous92i&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Root Cause&lt;/P&gt;&lt;P&gt;Observing the log:&lt;BR /&gt;"MERGE operation, scanning files for matches … 32 min | 3113/3113 files scanned (~72.2 GiB)" shows that every data file in the target is scanned during the merge. This leads to high input/output and long execution times.&lt;/P&gt;&lt;P&gt;Why is this happening despite Liquid Clustering?&lt;BR /&gt;Liquid Clustering only reclusters newly written files a little at a time. Older unoptimized files cause full scans.&lt;/P&gt;&lt;P&gt;Without frequent OPTIMIZE operations, files stay fragmented and are not reorganized by clustering keys. Because of this, Spark cannot effectively prune files for the predicate in MERGE.&lt;/P&gt;&lt;P&gt;Incremental Nature of Liquid Clustering: Liquid Clustering clusters newly written data files gradually. It does not reorganize existing files right away. So, if you don't run OPTIMIZE, files created before enabling Liquid Clustering stay unclustered. This leads to full scans during MERGE.&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/aws/en/delta/clustering" target="_blank" rel="noopener"&gt;https://docs.databricks.com/aws/en/delta/clustering&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Lack of OPTIMIZE After MERGE: MERGE operations do not automatically trigger full reclustering. Without regular OPTIMIZE to compact and recluster the data, many files remain fragmented and poorly organized along clustering keys. This causes the system to scan all files.&lt;/P&gt;&lt;P&gt;&lt;A href="https://www.youtube.com/watch?v=yZmrpXJg-G8" target="_blank" rel="noopener"&gt;https://www.youtube.com/watch?v=yZmrpXJg-G8&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Solution and Best Practices to Improve MERGE Performance&lt;/P&gt;&lt;P&gt;Schedule Frequent OPTIMIZE Commands: Run OPTIMIZE on your Delta table regularly after MERGE. This process physically reorganizes files based on clustering keys and merges small files. It helps with effective data skipping and file pruning during MERGE.&lt;/P&gt;&lt;P&gt;&lt;A href="https://dev.to/aj_ankit85/liquid-clustering-optimizing-databricks-workloads-for-performance-and-cost-4aai" target="_blank" rel="noopener"&gt;https://dev.to/aj_ankit85/liquid-clustering-optimizing-databricks-workloads-for-performance-and-cost-4aai&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Leverage Predicate Pushdown: Write your MERGE conditions to allow Spark to push down filters on clustering keys. This limits the files scanned by removing irrelevant files early.&lt;/P&gt;&lt;P&gt;Enable Photon Runtime: Use the Photon engine in Databricks Runtime 15.2 or later to gain from faster query execution and improvements in MERGE and clustering performance.&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/aws/en/delta/clustering" target="_blank" rel="noopener"&gt;https://docs.databricks.com/aws/en/delta/clustering&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Monitor File Size and Skew: Set up auto-compaction and adjust cluster size to cut down on too many small files and balance data distribution for better clustering.&lt;/P&gt;&lt;P&gt;Use Change Data Feed (CDF) for Incremental Updates: Whenever possible, handle incremental changes with CDF methods instead of full MERGE scans to lessen overhead.&lt;/P&gt;&lt;P&gt;Maintain Table Metadata and History: Regularly check Delta table metadata and transaction logs to confirm the clustering state and ensure OPTIMIZE jobs are running effectively.&lt;/P&gt;</description>
    <pubDate>Wed, 15 Oct 2025 14:05:05 GMT</pubDate>
    <dc:creator>ManojkMohan</dc:creator>
    <dc:date>2025-10-15T14:05:05Z</dc:date>
    <item>
      <title>Liquid Clustering With Merge</title>
      <link>https://community.databricks.com/t5/data-engineering/liquid-clustering-with-merge/m-p/135004#M50251</link>
      <description>&lt;P&gt;Hello&amp;nbsp;I’m facing severe performance issues with a&amp;nbsp; merge into databricks&lt;/P&gt;&lt;LI-CODE lang="python"&gt;merge_condition = """
    source.data_hierarchy = target.data_hierarchy AND
    source.sensor_id = target.sensor_id AND
    source.timestamp = target.timestamp
"""&lt;/LI-CODE&gt;&lt;P&gt;The target Delta table is using &lt;STRONG&gt;Liquid Clustering&lt;/STRONG&gt; on exactly the columns:&lt;BR /&gt;data_hierarchy, sensor_id, timestamp.&lt;/P&gt;&lt;P&gt;Yet, the MERGE operation takes &lt;STRONG&gt;over 40 minutes&lt;/STRONG&gt; (or sometimes more).&lt;BR /&gt;From the Spark logs, it seems like &lt;STRONG&gt;all files are being scanned&lt;/STRONG&gt;:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;MERGE operation - scanning files for matches
… 32 min | 3113/3113 files scanned (~72.2 GiB)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I would expect Liquid Clustering to reduce full-file scans when merge keys align with clustering keys.&lt;/P&gt;&lt;P&gt;My questions:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Why would Databricks still scan all files even with Liquid Clustering enabled?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Are there specific configuration tweaks or best practices to speed up MERGE in this setup?&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Thanks a lot for any help or shared experiences!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 15 Oct 2025 13:44:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/liquid-clustering-with-merge/m-p/135004#M50251</guid>
      <dc:creator>Mous92i</dc:creator>
      <dc:date>2025-10-15T13:44:18Z</dc:date>
    </item>
    <item>
      <title>Re: Liquid Clustering With Merge</title>
      <link>https://community.databricks.com/t5/data-engineering/liquid-clustering-with-merge/m-p/135006#M50253</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/191967"&gt;@Mous92i&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Root Cause&lt;/P&gt;&lt;P&gt;Observing the log:&lt;BR /&gt;"MERGE operation, scanning files for matches … 32 min | 3113/3113 files scanned (~72.2 GiB)" shows that every data file in the target is scanned during the merge. This leads to high input/output and long execution times.&lt;/P&gt;&lt;P&gt;Why is this happening despite Liquid Clustering?&lt;BR /&gt;Liquid Clustering only reclusters newly written files a little at a time. Older unoptimized files cause full scans.&lt;/P&gt;&lt;P&gt;Without frequent OPTIMIZE operations, files stay fragmented and are not reorganized by clustering keys. Because of this, Spark cannot effectively prune files for the predicate in MERGE.&lt;/P&gt;&lt;P&gt;Incremental Nature of Liquid Clustering: Liquid Clustering clusters newly written data files gradually. It does not reorganize existing files right away. So, if you don't run OPTIMIZE, files created before enabling Liquid Clustering stay unclustered. This leads to full scans during MERGE.&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/aws/en/delta/clustering" target="_blank" rel="noopener"&gt;https://docs.databricks.com/aws/en/delta/clustering&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Lack of OPTIMIZE After MERGE: MERGE operations do not automatically trigger full reclustering. Without regular OPTIMIZE to compact and recluster the data, many files remain fragmented and poorly organized along clustering keys. This causes the system to scan all files.&lt;/P&gt;&lt;P&gt;&lt;A href="https://www.youtube.com/watch?v=yZmrpXJg-G8" target="_blank" rel="noopener"&gt;https://www.youtube.com/watch?v=yZmrpXJg-G8&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Solution and Best Practices to Improve MERGE Performance&lt;/P&gt;&lt;P&gt;Schedule Frequent OPTIMIZE Commands: Run OPTIMIZE on your Delta table regularly after MERGE. This process physically reorganizes files based on clustering keys and merges small files. It helps with effective data skipping and file pruning during MERGE.&lt;/P&gt;&lt;P&gt;&lt;A href="https://dev.to/aj_ankit85/liquid-clustering-optimizing-databricks-workloads-for-performance-and-cost-4aai" target="_blank" rel="noopener"&gt;https://dev.to/aj_ankit85/liquid-clustering-optimizing-databricks-workloads-for-performance-and-cost-4aai&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Leverage Predicate Pushdown: Write your MERGE conditions to allow Spark to push down filters on clustering keys. This limits the files scanned by removing irrelevant files early.&lt;/P&gt;&lt;P&gt;Enable Photon Runtime: Use the Photon engine in Databricks Runtime 15.2 or later to gain from faster query execution and improvements in MERGE and clustering performance.&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/aws/en/delta/clustering" target="_blank" rel="noopener"&gt;https://docs.databricks.com/aws/en/delta/clustering&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Monitor File Size and Skew: Set up auto-compaction and adjust cluster size to cut down on too many small files and balance data distribution for better clustering.&lt;/P&gt;&lt;P&gt;Use Change Data Feed (CDF) for Incremental Updates: Whenever possible, handle incremental changes with CDF methods instead of full MERGE scans to lessen overhead.&lt;/P&gt;&lt;P&gt;Maintain Table Metadata and History: Regularly check Delta table metadata and transaction logs to confirm the clustering state and ensure OPTIMIZE jobs are running effectively.&lt;/P&gt;</description>
      <pubDate>Wed, 15 Oct 2025 14:05:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/liquid-clustering-with-merge/m-p/135006#M50253</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-10-15T14:05:05Z</dc:date>
    </item>
    <item>
      <title>Re: Liquid Clustering With Merge</title>
      <link>https://community.databricks.com/t5/data-engineering/liquid-clustering-with-merge/m-p/135147#M50287</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/191967"&gt;@Mous92i&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;DFP is what pushes &lt;EM data-start="411" data-end="419"&gt;source&lt;/EM&gt; filters down to the &lt;EM data-start="440" data-end="448"&gt;target&lt;/EM&gt; to skip files. For &lt;CODE data-start="468" data-end="489"&gt;MERGE/UPDATE/DELETE&lt;/CODE&gt;, DFP only works on &lt;STRONG data-start="509" data-end="527"&gt;Photon-enabled&lt;/STRONG&gt; compute. If you’re not on Photon, MERGE will scan everything.&lt;BR /&gt;&lt;BR /&gt;Enabling Liquid Clustering doesn’t recluster past files. Until you run&amp;nbsp;&amp;nbsp;&lt;CODE data-start="775" data-end="790"&gt;OPTIMIZE FULL&lt;/CODE&gt;&lt;STRONG data-start="773" data-end="797"&gt; once&lt;/STRONG&gt; (after enabling or changing keys), old files remain unclustered and don’t prune well. Then do regular &lt;CODE data-start="900" data-end="910"&gt;OPTIMIZE&lt;/CODE&gt; to keep up.&lt;/P&gt;</description>
      <pubDate>Thu, 16 Oct 2025 15:35:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/liquid-clustering-with-merge/m-p/135147#M50287</guid>
      <dc:creator>K_Anudeep</dc:creator>
      <dc:date>2025-10-16T15:35:32Z</dc:date>
    </item>
    <item>
      <title>Re: Liquid Clustering With Merge</title>
      <link>https://community.databricks.com/t5/data-engineering/liquid-clustering-with-merge/m-p/135247#M50312</link>
      <description>&lt;P&gt;Thanks for your response&lt;/P&gt;</description>
      <pubDate>Fri, 17 Oct 2025 13:52:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/liquid-clustering-with-merge/m-p/135247#M50312</guid>
      <dc:creator>Mous92i</dc:creator>
      <dc:date>2025-10-17T13:52:42Z</dc:date>
    </item>
  </channel>
</rss>

