@ShivangiB
You're correct in your understanding. When you change a clustering key using ALTER TABLE followed by OPTIMIZE, it doesn't automatically recluster existing data. Let me explain why this happens and what options you have.In Delta Lake (which Databricks uses), the ALTER TABLE CLUSTER BY statement only changes the clustering specification for future write operations. The OPTIMIZE command without additional parameters will only compact small files but won't reorder data according to the new clustering key.For your existing data to benefit from the new clustering key (key2), you need to explicitly request reclustering. You have two options:
- Run OPTIMIZE with ZORDER BY: OPTIMIZE table ZORDER BY (key2)
- Or rewrite the table entirely:
CREATE OR REPLACE TABLE table_temp CLUSTER BY (key2) AS SELECT * FROM table; -- Then rename tables
ALTER TABLE table RENAME TO table_old; ALTER TABLE table_temp RENAME TO table;
The reason your test showed that existing files weren't rewritten is because OPTIMIZE alone primarily focuses on file compaction rather than data reorganization. For large historical datasets, you need to explicitly trigger the reclustering process to see performance improvements for queries filtering on key2.