- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-11-2025 08:56 AM
MERGE is not a pure read plus filter operation
Even though Liquid Clustering organizes your data by key ranges and writes min/max stats, the MERGE engine has to identify both matches and non-matches.
That means the query planner must:
Scan all candidate clusters that might contain keys from the incoming batch, and
Verify whether the keys exist or not.
Unless it can safely infer a bounded key range to check, the planner conservatively scans all clusters.
Liquid Clustering tracks min/max statistics per column, but not multi-column composite keys.
When you merge on (timestamp, x, y), Delta can only skip files if all three columns’ ranges are mutually exclusive with the incoming keys.
So even with clustering, you can end up touching most clusters.
MERGE INTO target USING source ON <join condition> does not automatically push filters based on min/max stats of the source.
Delta cannot assume, for example, “buffer only contains recent timestamps, so skip older clusters,” unless you explicitly tell it via a predicate.
Hope it helps!