3 weeks ago
Hi everyone,
Iโm evaluating whether Liquid Clustering would be beneficial for the tables based on the sizes. Below is the size distribution of tables in my environment:
| Large (> 1 TB) | 3 |
| Medium (10 GB โ 1 TB) | 284 |
| Small (< 10 GB) | 17,266 |
Given this distribution, would Liquid Clustering provide significant performance improvements, or is it better suited for environments with more large tables?
Any best practices or considerations would be greatly appreciated.
3 weeks ago
Greetings @pooja_bhumandla
Based on your size distribution, enabling Liquid Clustering can provide meaningful gainsโbut youโll get the highest ROI by prioritizing your medium and large tables first and selectively applying it to small tables where query patterns warrant it.
Liquid Clustering replaces partitioning and ZORDER and is recommended for new Delta tables to simplify layout decisions and improve data skipping and file pruning for filtered queries.
It is particularly beneficial for tables that are frequently filtered by highโcardinality columns, have skewed distributions, grow quickly, see concurrent writes, or have evolving access patterns.
For very small tables, Automatic Liquid Clustering may choose not to select keys because the table is too small to benefit; LCโs gains are workload-dependent and cost-aware.
Large (>1 TB) โ 3 tables: Strong candidates. Enable LC and cluster by the most common filter columns; run an initial OPTIMIZE FULL to recluster existing data and schedule ongoing OPTIMIZE (or enable Predictive Optimization). Expect noticeable read speedups and simpler maintenance compared to partitioning/ZORDER.
Medium (10 GB โ 1 TB) โ 284 tables: High ROI segment. LC generally improves file pruning and read latencies; recommended for new and existing Delta tables. Avoid overโspecifying clustering keys on smaller side of this range (see key count guidance below).
Small (< 10 GB) โ 17,266 tables: Apply LC selectively. If the tables are mostly broadcast joined, scanned in full, or used as small dimensions, LC offers limited benefit versus simply relying on broadcast joins, table stats, and caching.
If these small tables are frequently filtered on highโcardinality columns, LC can still help with data skipping, but Automatic LC may skip keys if theyโre very small; weigh OPTIMIZE costs vs. query savings.Start with 1โ2 keys most frequently used in filters, especially highโcardinality columns (e.g., timestamp, user_id). On smaller tables (under ~10 TB), too many keys can hurt singleโcolumn filter performance; cap at two unless multiโcolumn filters dominate.
If you previously partitioned or Zโordered, reuse those columns as LC keys (partition columns, ZORDER BY columns).
-- New table
CREATE TABLE my_uc_tbl (...) CLUSTER BY AUTO;
-- Existing table
ALTER TABLE my_uc_tbl CLUSTER BY AUTO;
OPTIMIZE regularly; it incrementally reclusters only whatโs needed. Without Predictive Optimization, schedule OPTIMIZE hourly or every few hours on highly updated tables; with PO, disable manual schedules and let PO run it.
OPTIMIZE my_tbl; -- incremental reclustering
OPTIMIZE my_tbl FULL; -- force reclustering of all records (first enable or key change)
OPTIMIZE FULL once when enabling LC or changing keys; this can take hours but sets a clean baseline.spark.databricks.delta.liquid.eagerClustering.streaming.enabled=true, but thresholds apply and many streaming micro-batches wonโt trigger; OPTIMIZE catches up incrementally.ANALYZE TABLE ... COMPUTE STATISTICS FOR ALL COLUMNS), and use Delta caching for repeated access; these tend to deliver more benefit than LC for tiny lookup/dimension tables.Avoid partitioning tables under ~1 TB; LC (or ingestionโtime clustering + autoโoptimize) is a better choice to prevent smallโfile and skew issues while retaining data skipping.
Use Predictive Optimization on UC managed tables to automate OPTIMIZE/VACUUM and LC maintenance across your estate.
Prioritize enabling LC (or LC AUTO) on the 3 large and 284 medium tables, beginning with those most queried via filters and those with skew/concurrency; measure query latency and scan bytes before/after.
For the 17k+ small tables, only enable LC where the workload is filterโheavy on highโcardinality columns; otherwise rely on broadcast/stats/caching for best overall costโperformance and leave LC to PO where it decides itโs worthwhile.
Hope this helps, Louis.
3 weeks ago
Greetings @pooja_bhumandla
Based on your size distribution, enabling Liquid Clustering can provide meaningful gainsโbut youโll get the highest ROI by prioritizing your medium and large tables first and selectively applying it to small tables where query patterns warrant it.
Liquid Clustering replaces partitioning and ZORDER and is recommended for new Delta tables to simplify layout decisions and improve data skipping and file pruning for filtered queries.
It is particularly beneficial for tables that are frequently filtered by highโcardinality columns, have skewed distributions, grow quickly, see concurrent writes, or have evolving access patterns.
For very small tables, Automatic Liquid Clustering may choose not to select keys because the table is too small to benefit; LCโs gains are workload-dependent and cost-aware.
Large (>1 TB) โ 3 tables: Strong candidates. Enable LC and cluster by the most common filter columns; run an initial OPTIMIZE FULL to recluster existing data and schedule ongoing OPTIMIZE (or enable Predictive Optimization). Expect noticeable read speedups and simpler maintenance compared to partitioning/ZORDER.
Medium (10 GB โ 1 TB) โ 284 tables: High ROI segment. LC generally improves file pruning and read latencies; recommended for new and existing Delta tables. Avoid overโspecifying clustering keys on smaller side of this range (see key count guidance below).
Small (< 10 GB) โ 17,266 tables: Apply LC selectively. If the tables are mostly broadcast joined, scanned in full, or used as small dimensions, LC offers limited benefit versus simply relying on broadcast joins, table stats, and caching.
If these small tables are frequently filtered on highโcardinality columns, LC can still help with data skipping, but Automatic LC may skip keys if theyโre very small; weigh OPTIMIZE costs vs. query savings.Start with 1โ2 keys most frequently used in filters, especially highโcardinality columns (e.g., timestamp, user_id). On smaller tables (under ~10 TB), too many keys can hurt singleโcolumn filter performance; cap at two unless multiโcolumn filters dominate.
If you previously partitioned or Zโordered, reuse those columns as LC keys (partition columns, ZORDER BY columns).
-- New table
CREATE TABLE my_uc_tbl (...) CLUSTER BY AUTO;
-- Existing table
ALTER TABLE my_uc_tbl CLUSTER BY AUTO;
OPTIMIZE regularly; it incrementally reclusters only whatโs needed. Without Predictive Optimization, schedule OPTIMIZE hourly or every few hours on highly updated tables; with PO, disable manual schedules and let PO run it.
OPTIMIZE my_tbl; -- incremental reclustering
OPTIMIZE my_tbl FULL; -- force reclustering of all records (first enable or key change)
OPTIMIZE FULL once when enabling LC or changing keys; this can take hours but sets a clean baseline.spark.databricks.delta.liquid.eagerClustering.streaming.enabled=true, but thresholds apply and many streaming micro-batches wonโt trigger; OPTIMIZE catches up incrementally.ANALYZE TABLE ... COMPUTE STATISTICS FOR ALL COLUMNS), and use Delta caching for repeated access; these tend to deliver more benefit than LC for tiny lookup/dimension tables.Avoid partitioning tables under ~1 TB; LC (or ingestionโtime clustering + autoโoptimize) is a better choice to prevent smallโfile and skew issues while retaining data skipping.
Use Predictive Optimization on UC managed tables to automate OPTIMIZE/VACUUM and LC maintenance across your estate.
Prioritize enabling LC (or LC AUTO) on the 3 large and 284 medium tables, beginning with those most queried via filters and those with skew/concurrency; measure query latency and scan bytes before/after.
For the 17k+ small tables, only enable LC where the workload is filterโheavy on highโcardinality columns; otherwise rely on broadcast/stats/caching for best overall costโperformance and leave LC to PO where it decides itโs worthwhile.
Hope this helps, Louis.
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now