topic Re: Liquid Clustering Key Change Question in Get Started Discussions

Liquid Clustering Key Change Question

ShivangiB — Wed, 05 Mar 2025 15:22:54 GMT

If i already have a cluster key1 for existing table, i want to change cluster key to key2 using

ALTER TABLE table CLUSTER BY (key2), then run OPTIMIZE table, based on databrick document , existing files will not be rewritten (verified by my test as well), cluster Key2 will only be applied incrementally. If my existing files are huge (saying having long history), using Key2 will not improve performance when i search the table (e.g. select * from table where key2='xxxx'). is liquid cluster supposed to work this way or i am missing anything?

Re: Liquid Clustering Key Change Question

koji_kawamura — Wed, 12 Mar 2025 02:38:32 GMT

Hi @ShivangiB Yes, Liquid cluster is supposed to work that way. The behavior that only applies new keys to the new data supports the characteristic of "Tables with access patterns that change over time". It allows us to change clustering keys based on query pattern changes without rewriting the historical data.

If the historical data needs to use the new clustering keys, we can use OPTIMIZE FULL, which requires I/O costs and time.

Re: Liquid Clustering Key Change Question

lingareddy_Alva — Wed, 12 Mar 2025 02:58:25 GMT

@ShivangiB

You're correct in your understanding. When you change a clustering key using ALTER TABLE followed by OPTIMIZE, it doesn't automatically recluster existing data. Let me explain why this happens and what options you have.In Delta Lake (which Databricks uses), the ALTER TABLE CLUSTER BY statement only changes the clustering specification for future write operations. The OPTIMIZE command without additional parameters will only compact small files but won't reorder data according to the new clustering key.For your existing data to benefit from the new clustering key (key2), you need to explicitly request reclustering. You have two options:

Run OPTIMIZE with ZORDER BY: OPTIMIZE table ZORDER BY (key2)
Or rewrite the table entirely:

CREATE OR REPLACE TABLE table_temp CLUSTER BY (key2) AS SELECT * FROM table; -- Then rename tables

ALTER TABLE table RENAME TO table_old; ALTER TABLE table_temp RENAME TO table;

The reason your test showed that existing files weren't rewritten is because OPTIMIZE alone primarily focuses on file compaction rather than data reorganization. For large historical datasets, you need to explicitly trigger the reclustering process to see performance improvements for queries filtering on key2.