MERGE operation not performing data skipping with liquid clustering on key columns

DatabricksEngi1
Contributor

 

Hi, I need some help understanding a performance issue.

I have a table that reads approximately 800K records every 30 minutes in an incremental manner.
Let’s say its primary key is:

timestamp, x, y
 

This table is overwritten every 30 minutes and serves as a BUFFER table holding the current batch of data.

In addition, I have another table that stores all historical runs, with the same primary key.

I’m using a MERGE operation to update existing records (when the key already exists) or insert new ones (when the key is not found).

From my understanding, once Liquid Clustering is defined on the key columns used in the MERGE, the process should be able to perform data skipping, ignoring files that are not relevant in the target table (the historical table).

However, that’s not happening - instead, each run results in a full scan of the target table.

I’ve verified that:

  • The BUFFER table contains only the most recent incremental data, meaning it brings in only records that were not included in the previous run.

  • The historical table contains data spanning several years.

Despite this, the MERGE operation still performs a full scan, and no data skipping occurs.

Why is that?
I assume that if I explicitly add a filter in the MERGE condition, for example:

 

 
and dt >= (min dt from the BUFFER table)
 

then data skipping would occur - but I’d like to understand whether this should happen automatically, at least theoretically, when Liquid Clustering is defined on the MERGE key columns.

 

Thank you!