3 weeks ago
Hi everyone,
Iām evaluating whether Liquid Clustering would be beneficial for the tables based on the sizes. Below is the size distribution of tables in my environment:
| Large (> 1 TB) | 3 |
| Medium (10 GB ā 1 TB) | 284 |
| Small (< 10 GB) | 17,266 |
Given this distribution, would Liquid Clustering provide significant performance improvements, or is it better suited for environments with more large tables?
Any best practices or considerations would be greatly appreciated.
3 weeks ago
Greetings @pooja_bhumandla
Based on your size distribution, enabling Liquid Clustering can provide meaningful gainsābut youāll get the highest ROI by prioritizing your medium and large tables first and selectively applying it to small tables where query patterns warrant it.
Liquid Clustering replaces partitioning and ZORDER and is recommended for new Delta tables to simplify layout decisions and improve data skipping and file pruning for filtered queries.
It is particularly beneficial for tables that are frequently filtered by highācardinality columns, have skewed distributions, grow quickly, see concurrent writes, or have evolving access patterns.
For very small tables, Automatic Liquid Clustering may choose not to select keys because the table is too small to benefit; LCās gains are workload-dependent and cost-aware.
Large (>1 TB) ā 3 tables: Strong candidates. Enable LC and cluster by the most common filter columns; run an initial OPTIMIZE FULL to recluster existing data and schedule ongoing OPTIMIZE (or enable Predictive Optimization). Expect noticeable read speedups and simpler maintenance compared to partitioning/ZORDER.
Medium (10 GB ā 1 TB) ā 284 tables: High ROI segment. LC generally improves file pruning and read latencies; recommended for new and existing Delta tables. Avoid overāspecifying clustering keys on smaller side of this range (see key count guidance below).
Small (< 10 GB) ā 17,266 tables: Apply LC selectively. If the tables are mostly broadcast joined, scanned in full, or used as small dimensions, LC offers limited benefit versus simply relying on broadcast joins, table stats, and caching.
If these small tables are frequently filtered on highācardinality columns, LC can still help with data skipping, but Automatic LC may skip keys if theyāre very small; weigh OPTIMIZE costs vs. query savings.Start with 1ā2 keys most frequently used in filters, especially highācardinality columns (e.g., timestamp, user_id). On smaller tables (under ~10 TB), too many keys can hurt singleācolumn filter performance; cap at two unless multiācolumn filters dominate.
If you previously partitioned or Zāordered, reuse those columns as LC keys (partition columns, ZORDER BY columns).
-- New table
CREATE TABLE my_uc_tbl (...) CLUSTER BY AUTO;
-- Existing table
ALTER TABLE my_uc_tbl CLUSTER BY AUTO;
OPTIMIZE regularly; it incrementally reclusters only whatās needed. Without Predictive Optimization, schedule OPTIMIZE hourly or every few hours on highly updated tables; with PO, disable manual schedules and let PO run it.
OPTIMIZE my_tbl; -- incremental reclustering
OPTIMIZE my_tbl FULL; -- force reclustering of all records (first enable or key change)
OPTIMIZE FULL once when enabling LC or changing keys; this can take hours but sets a clean baseline.spark.databricks.delta.liquid.eagerClustering.streaming.enabled=true, but thresholds apply and many streaming micro-batches wonāt trigger; OPTIMIZE catches up incrementally.ANALYZE TABLE ... COMPUTE STATISTICS FOR ALL COLUMNS), and use Delta caching for repeated access; these tend to deliver more benefit than LC for tiny lookup/dimension tables.Avoid partitioning tables under ~1 TB; LC (or ingestionātime clustering + autoāoptimize) is a better choice to prevent smallāfile and skew issues while retaining data skipping.
Use Predictive Optimization on UC managed tables to automate OPTIMIZE/VACUUM and LC maintenance across your estate.
Prioritize enabling LC (or LC AUTO) on the 3 large and 284 medium tables, beginning with those most queried via filters and those with skew/concurrency; measure query latency and scan bytes before/after.
For the 17k+ small tables, only enable LC where the workload is filterāheavy on highācardinality columns; otherwise rely on broadcast/stats/caching for best overall costāperformance and leave LC to PO where it decides itās worthwhile.
Hope this helps, Louis.
3 weeks ago
Greetings @pooja_bhumandla
Based on your size distribution, enabling Liquid Clustering can provide meaningful gainsābut youāll get the highest ROI by prioritizing your medium and large tables first and selectively applying it to small tables where query patterns warrant it.
Liquid Clustering replaces partitioning and ZORDER and is recommended for new Delta tables to simplify layout decisions and improve data skipping and file pruning for filtered queries.
It is particularly beneficial for tables that are frequently filtered by highācardinality columns, have skewed distributions, grow quickly, see concurrent writes, or have evolving access patterns.
For very small tables, Automatic Liquid Clustering may choose not to select keys because the table is too small to benefit; LCās gains are workload-dependent and cost-aware.
Large (>1 TB) ā 3 tables: Strong candidates. Enable LC and cluster by the most common filter columns; run an initial OPTIMIZE FULL to recluster existing data and schedule ongoing OPTIMIZE (or enable Predictive Optimization). Expect noticeable read speedups and simpler maintenance compared to partitioning/ZORDER.
Medium (10 GB ā 1 TB) ā 284 tables: High ROI segment. LC generally improves file pruning and read latencies; recommended for new and existing Delta tables. Avoid overāspecifying clustering keys on smaller side of this range (see key count guidance below).
Small (< 10 GB) ā 17,266 tables: Apply LC selectively. If the tables are mostly broadcast joined, scanned in full, or used as small dimensions, LC offers limited benefit versus simply relying on broadcast joins, table stats, and caching.
If these small tables are frequently filtered on highācardinality columns, LC can still help with data skipping, but Automatic LC may skip keys if theyāre very small; weigh OPTIMIZE costs vs. query savings.Start with 1ā2 keys most frequently used in filters, especially highācardinality columns (e.g., timestamp, user_id). On smaller tables (under ~10 TB), too many keys can hurt singleācolumn filter performance; cap at two unless multiācolumn filters dominate.
If you previously partitioned or Zāordered, reuse those columns as LC keys (partition columns, ZORDER BY columns).
-- New table
CREATE TABLE my_uc_tbl (...) CLUSTER BY AUTO;
-- Existing table
ALTER TABLE my_uc_tbl CLUSTER BY AUTO;
OPTIMIZE regularly; it incrementally reclusters only whatās needed. Without Predictive Optimization, schedule OPTIMIZE hourly or every few hours on highly updated tables; with PO, disable manual schedules and let PO run it.
OPTIMIZE my_tbl; -- incremental reclustering
OPTIMIZE my_tbl FULL; -- force reclustering of all records (first enable or key change)
OPTIMIZE FULL once when enabling LC or changing keys; this can take hours but sets a clean baseline.spark.databricks.delta.liquid.eagerClustering.streaming.enabled=true, but thresholds apply and many streaming micro-batches wonāt trigger; OPTIMIZE catches up incrementally.ANALYZE TABLE ... COMPUTE STATISTICS FOR ALL COLUMNS), and use Delta caching for repeated access; these tend to deliver more benefit than LC for tiny lookup/dimension tables.Avoid partitioning tables under ~1 TB; LC (or ingestionātime clustering + autoāoptimize) is a better choice to prevent smallāfile and skew issues while retaining data skipping.
Use Predictive Optimization on UC managed tables to automate OPTIMIZE/VACUUM and LC maintenance across your estate.
Prioritize enabling LC (or LC AUTO) on the 3 large and 284 medium tables, beginning with those most queried via filters and those with skew/concurrency; measure query latency and scan bytes before/after.
For the 17k+ small tables, only enable LC where the workload is filterāheavy on highācardinality columns; otherwise rely on broadcast/stats/caching for best overall costāperformance and leave LC to PO where it decides itās worthwhile.
Hope this helps, Louis.
Passionate about hosting events and connecting people? Help us grow a vibrant local communityāsign up today to get started!
Sign Up Now