cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Liquid Clustering Key Change Question

ShivangiB
New Contributor III

If i already have a cluster key1 for existing table, i want to change cluster key to key2 using 

ALTER TABLE table CLUSTER BY (key2), then run OPTIMIZE table, based on databrick document , existing files will not be rewritten (verified by my test as well), cluster Key2 will only be applied incrementally.  If my existing files are huge (saying having long history), using Key2 will not improve performance when i search the table (e.g. select * from table where key2='xxxx'). is liquid cluster supposed to work this way or i am missing anything?
2 REPLIES 2

koji_kawamura
Databricks Employee
Databricks Employee

Hi @ShivangiB Yes, Liquid cluster is supposed to work that way. The behavior that only applies new keys to the new data supports the characteristic of "Tables with access patterns that change over time". It allows us to change clustering keys based on query pattern changes without rewriting the historical data.

If the historical data needs to use the new clustering keys, we can use OPTIMIZE FULL, which requires I/O costs and time.

LRALVA
Contributor II

@ShivangiB 

You're correct in your understanding. When you change a clustering key using ALTER TABLE followed by OPTIMIZE, it doesn't automatically recluster existing data. Let me explain why this happens and what options you have.In Delta Lake (which Databricks uses), the ALTER TABLE CLUSTER BY statement only changes the clustering specification for future write operations. The OPTIMIZE command without additional parameters will only compact small files but won't reorder data according to the new clustering key.For your existing data to benefit from the new clustering key (key2), you need to explicitly request reclustering. You have two options:

  1. Run OPTIMIZE with ZORDER BY:       OPTIMIZE table ZORDER BY (key2)
  2. Or rewrite the table entirely:  

          CREATE OR REPLACE TABLE table_temp CLUSTER BY (key2) AS SELECT * FROM table; -- Then rename              tables

           ALTER TABLE table RENAME TO table_old; ALTER TABLE table_temp RENAME TO table;

The reason your test showed that existing files weren't rewritten is because OPTIMIZE alone primarily focuses on file compaction rather than data reorganization. For large historical datasets, you need to explicitly trigger the reclustering process to see performance improvements for queries filtering on key2.

 

 

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now