cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

MERGE operation not performing data skipping with liquid clustering on key columns

DatabricksEngi1
Contributor

 

Hi, I need some help understanding a performance issue.

I have a table that reads approximately 800K records every 30 minutes in an incremental manner.
Let’s say its primary key is:

timestamp, x, y
 

This table is overwritten every 30 minutes and serves as a BUFFER table holding the current batch of data.

In addition, I have another table that stores all historical runs, with the same primary key.

I’m using a MERGE operation to update existing records (when the key already exists) or insert new ones (when the key is not found).

From my understanding, once Liquid Clustering is defined on the key columns used in the MERGE, the process should be able to perform data skipping, ignoring files that are not relevant in the target table (the historical table).

However, that’s not happening - instead, each run results in a full scan of the target table.

I’ve verified that:

  • The BUFFER table contains only the most recent incremental data, meaning it brings in only records that were not included in the previous run.

  • The historical table contains data spanning several years.

Despite this, the MERGE operation still performs a full scan, and no data skipping occurs.

Why is that?
I assume that if I explicitly add a filter in the MERGE condition, for example:

 

 
and dt >= (min dt from the BUFFER table)
 

then data skipping would occur - but I’d like to understand whether this should happen automatically, at least theoretically, when Liquid Clustering is defined on the MERGE key columns.

 

Thank you!

2 ACCEPTED SOLUTIONS

Accepted Solutions

bianca_unifeye
New Contributor II

MERGE is not a pure read plus filter operation

Even though Liquid Clustering organizes your data by key ranges and writes min/max stats, the MERGE engine has to identify both matches and non-matches.

That means the query planner must:

  • Scan all candidate clusters that might contain keys from the incoming batch, and

  • Verify whether the keys exist or not.

Unless it can safely infer a bounded key range to check, the planner conservatively scans all clusters.

Liquid Clustering tracks min/max statistics per column, but not multi-column composite keys.
When you merge on (timestamp, x, y), Delta can only skip files if all three columns’ ranges are mutually exclusive with the incoming keys.

So even with clustering, you can end up touching most clusters.

MERGE INTO target USING source ON <join condition> does not automatically push filters based on min/max stats of the source.
Delta cannot assume, for example, “buffer only contains recent timestamps, so skip older clusters,” unless you explicitly tell it via a predicate.

Hope it helps!

View solution in original post

If your MERGE condition used only one clustering column (for example just timestamp), then yes, data skipping could occur more effectively, but only if the following conditions hold true:

  • Your Liquid Clustering (or Z-Ordering) is defined on the same column you use in the ON clause  e.g. CLUSTER BY (timestamp).

  • The incoming buffer has a narrow and contiguous range of timestamps (for instance, just the last 30 minutes).

  • Delta can therefore compare the buffer’s min(timestamp) – max(timestamp) range to each cluster’s stored range and skip clusters completely outside that interval.

In that case, only the clusters overlapping the current batch’s timestamp window would be scanned, this is what we call range-based pruning.

View solution in original post

4 REPLIES 4

bianca_unifeye
New Contributor II

MERGE is not a pure read plus filter operation

Even though Liquid Clustering organizes your data by key ranges and writes min/max stats, the MERGE engine has to identify both matches and non-matches.

That means the query planner must:

  • Scan all candidate clusters that might contain keys from the incoming batch, and

  • Verify whether the keys exist or not.

Unless it can safely infer a bounded key range to check, the planner conservatively scans all clusters.

Liquid Clustering tracks min/max statistics per column, but not multi-column composite keys.
When you merge on (timestamp, x, y), Delta can only skip files if all three columns’ ranges are mutually exclusive with the incoming keys.

So even with clustering, you can end up touching most clusters.

MERGE INTO target USING source ON <join condition> does not automatically push filters based on min/max stats of the source.
Delta cannot assume, for example, “buffer only contains recent timestamps, so skip older clusters,” unless you explicitly tell it via a predicate.

Hope it helps!

Thank you very much for your answer.

Just to make sure I understood correctly -

if my ON condition included only timestamp (and not a combination of three columns),

would data skipping occur in that case?

 

Thank you very much!

If your MERGE condition used only one clustering column (for example just timestamp), then yes, data skipping could occur more effectively, but only if the following conditions hold true:

  • Your Liquid Clustering (or Z-Ordering) is defined on the same column you use in the ON clause  e.g. CLUSTER BY (timestamp).

  • The incoming buffer has a narrow and contiguous range of timestamps (for instance, just the last 30 minutes).

  • Delta can therefore compare the buffer’s min(timestamp) – max(timestamp) range to each cluster’s stored range and skip clusters completely outside that interval.

In that case, only the clusters overlapping the current batch’s timestamp window would be scanned, this is what we call range-based pruning.

Thank you very much!