<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic MERGE operation not performing data skipping with liquid clustering on key columns in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/merge-operation-not-performing-data-skipping-with-liquid/m-p/138555#M50954</link>
    <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hi, I need some help understanding a performance issue.&lt;/P&gt;&lt;P&gt;I have a table that reads approximately 800K records every 30 minutes in an incremental manner.&lt;BR /&gt;Let’s say its primary key is:&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;timestamp, x, y&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;&lt;P&gt;This table is overwritten every 30 minutes and serves as a BUFFER table holding the current batch of data.&lt;/P&gt;&lt;P&gt;In addition, I have another table that stores all historical runs, with the same primary key.&lt;/P&gt;&lt;P&gt;I’m using a MERGE operation to update existing records (when the key already exists) or insert new ones (when the key is not found).&lt;/P&gt;&lt;P&gt;From my understanding, once Liquid Clustering is defined on the key columns used in the MERGE, the process should be able to perform data skipping, ignoring files that are not relevant in the target table (the historical table).&lt;/P&gt;&lt;P&gt;However, that’s not happening - instead, each run results in a full scan of the target table.&lt;/P&gt;&lt;P&gt;I’ve verified that:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;The BUFFER table contains only the most recent incremental data, meaning it brings in only records that were not included in the previous run.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;The historical table contains data spanning several years.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Despite this, the MERGE operation still performs a full scan, and no data skipping occurs.&lt;/P&gt;&lt;P&gt;Why is that?&lt;BR /&gt;I assume that if I explicitly add a filter in the MERGE condition, for example:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;DIV class=""&gt;and dt &lt;SPAN class=""&gt;&lt;SPAN class=""&gt;&amp;gt;= (min dt &lt;SPAN class=""&gt;from the BUFFER &lt;SPAN class=""&gt;table)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;&lt;P&gt;then data skipping would occur - but I’d like to understand whether this should happen automatically, at least theoretically, when Liquid Clustering is defined on the MERGE key columns.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you!&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Tue, 11 Nov 2025 10:39:56 GMT</pubDate>
    <dc:creator>DatabricksEngi1</dc:creator>
    <dc:date>2025-11-11T10:39:56Z</dc:date>
    <item>
      <title>MERGE operation not performing data skipping with liquid clustering on key columns</title>
      <link>https://community.databricks.com/t5/data-engineering/merge-operation-not-performing-data-skipping-with-liquid/m-p/138555#M50954</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hi, I need some help understanding a performance issue.&lt;/P&gt;&lt;P&gt;I have a table that reads approximately 800K records every 30 minutes in an incremental manner.&lt;BR /&gt;Let’s say its primary key is:&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;timestamp, x, y&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;&lt;P&gt;This table is overwritten every 30 minutes and serves as a BUFFER table holding the current batch of data.&lt;/P&gt;&lt;P&gt;In addition, I have another table that stores all historical runs, with the same primary key.&lt;/P&gt;&lt;P&gt;I’m using a MERGE operation to update existing records (when the key already exists) or insert new ones (when the key is not found).&lt;/P&gt;&lt;P&gt;From my understanding, once Liquid Clustering is defined on the key columns used in the MERGE, the process should be able to perform data skipping, ignoring files that are not relevant in the target table (the historical table).&lt;/P&gt;&lt;P&gt;However, that’s not happening - instead, each run results in a full scan of the target table.&lt;/P&gt;&lt;P&gt;I’ve verified that:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;The BUFFER table contains only the most recent incremental data, meaning it brings in only records that were not included in the previous run.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;The historical table contains data spanning several years.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Despite this, the MERGE operation still performs a full scan, and no data skipping occurs.&lt;/P&gt;&lt;P&gt;Why is that?&lt;BR /&gt;I assume that if I explicitly add a filter in the MERGE condition, for example:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;DIV class=""&gt;and dt &lt;SPAN class=""&gt;&lt;SPAN class=""&gt;&amp;gt;= (min dt &lt;SPAN class=""&gt;from the BUFFER &lt;SPAN class=""&gt;table)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;&lt;P&gt;then data skipping would occur - but I’d like to understand whether this should happen automatically, at least theoretically, when Liquid Clustering is defined on the MERGE key columns.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you!&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 11 Nov 2025 10:39:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/merge-operation-not-performing-data-skipping-with-liquid/m-p/138555#M50954</guid>
      <dc:creator>DatabricksEngi1</dc:creator>
      <dc:date>2025-11-11T10:39:56Z</dc:date>
    </item>
    <item>
      <title>Re: MERGE operation not performing data skipping with liquid clustering on key columns</title>
      <link>https://community.databricks.com/t5/data-engineering/merge-operation-not-performing-data-skipping-with-liquid/m-p/138627#M50981</link>
      <description>&lt;P&gt;MERGE is not a pure read plus filter operation&lt;/P&gt;&lt;P&gt;Even though Liquid Clustering organizes your data by key ranges and writes min/max stats, the &lt;STRONG&gt;MERGE engine&lt;/STRONG&gt; has to identify &lt;EM&gt;both matches and non-matches&lt;/EM&gt;.&lt;/P&gt;&lt;P&gt;That means the query planner must:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Scan all candidate clusters that &lt;STRONG&gt;might&lt;/STRONG&gt; contain keys from the incoming batch, &lt;EM&gt;and&lt;/EM&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Verify whether the keys exist or not.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Unless it can safely infer a bounded key range to check, the planner conservatively scans all clusters.&lt;/P&gt;&lt;P&gt;Liquid Clustering tracks &lt;STRONG&gt;min/max statistics per column&lt;/STRONG&gt;, but not multi-column composite keys.&lt;BR /&gt;When you merge on (timestamp, x, y), Delta can only skip files if &lt;STRONG&gt;all three&lt;/STRONG&gt; columns’ ranges are mutually exclusive with the incoming keys.&lt;/P&gt;&lt;P&gt;So even with clustering, you can end up touching most clusters.&lt;/P&gt;&lt;P&gt;MERGE INTO target USING source ON &amp;lt;join condition&amp;gt; does &lt;EM&gt;not&lt;/EM&gt; automatically push filters based on min/max stats of the source.&lt;BR /&gt;Delta cannot assume, for example, “buffer only contains recent timestamps, so skip older clusters,” unless you &lt;STRONG&gt;explicitly&lt;/STRONG&gt; tell it via a predicate.&lt;/P&gt;&lt;P&gt;Hope it helps!&lt;/P&gt;</description>
      <pubDate>Tue, 11 Nov 2025 16:56:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/merge-operation-not-performing-data-skipping-with-liquid/m-p/138627#M50981</guid>
      <dc:creator>bianca_unifeye</dc:creator>
      <dc:date>2025-11-11T16:56:52Z</dc:date>
    </item>
    <item>
      <title>Re: MERGE operation not performing data skipping with liquid clustering on key columns</title>
      <link>https://community.databricks.com/t5/data-engineering/merge-operation-not-performing-data-skipping-with-liquid/m-p/138644#M50987</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Thank you very much for your answer.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Just to make sure I understood correctly -&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;if my ON condition included only timestamp (and not a combination of three columns),&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;would data skipping occur in that case?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Thank you very much!&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 11 Nov 2025 19:31:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/merge-operation-not-performing-data-skipping-with-liquid/m-p/138644#M50987</guid>
      <dc:creator>DatabricksEngi1</dc:creator>
      <dc:date>2025-11-11T19:31:59Z</dc:date>
    </item>
    <item>
      <title>Re: MERGE operation not performing data skipping with liquid clustering on key columns</title>
      <link>https://community.databricks.com/t5/data-engineering/merge-operation-not-performing-data-skipping-with-liquid/m-p/138765#M51006</link>
      <description>&lt;P&gt;If your &lt;STRONG&gt;MERGE condition&lt;/STRONG&gt; used only one clustering column (for example just timestamp), then &lt;STRONG&gt;yes, data skipping &lt;EM&gt;could&lt;/EM&gt; occur more effectively&lt;/STRONG&gt;, but &lt;STRONG&gt;only if&lt;/STRONG&gt; the following conditions hold true:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Your &lt;STRONG&gt;Liquid Clustering&lt;/STRONG&gt; (or Z-Ordering) is defined on the same column you use in the ON clause&amp;nbsp; e.g. CLUSTER BY (timestamp).&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;The &lt;STRONG&gt;incoming buffer&lt;/STRONG&gt; has a &lt;EM&gt;narrow&lt;/EM&gt; and &lt;EM&gt;contiguous&lt;/EM&gt; range of timestamps (for instance, just the last 30 minutes).&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Delta can therefore compare the buffer’s min(timestamp) – max(timestamp) range to each cluster’s stored range and &lt;STRONG&gt;skip clusters completely outside&lt;/STRONG&gt; that interval.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;In that case, only the clusters overlapping the current batch’s timestamp window would be scanned, this is what we call &lt;STRONG&gt;range-based pruning&lt;/STRONG&gt;.&lt;/P&gt;</description>
      <pubDate>Wed, 12 Nov 2025 11:35:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/merge-operation-not-performing-data-skipping-with-liquid/m-p/138765#M51006</guid>
      <dc:creator>bianca_unifeye</dc:creator>
      <dc:date>2025-11-12T11:35:18Z</dc:date>
    </item>
    <item>
      <title>Re: MERGE operation not performing data skipping with liquid clustering on key columns</title>
      <link>https://community.databricks.com/t5/data-engineering/merge-operation-not-performing-data-skipping-with-liquid/m-p/138771#M51010</link>
      <description>&lt;P&gt;Thank you very much!&lt;/P&gt;</description>
      <pubDate>Wed, 12 Nov 2025 12:07:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/merge-operation-not-performing-data-skipping-with-liquid/m-p/138771#M51010</guid>
      <dc:creator>DatabricksEngi1</dc:creator>
      <dc:date>2025-11-12T12:07:34Z</dc:date>
    </item>
  </channel>
</rss>

