3 weeks ago
Hi everyone,
I’m evaluating whether Liquid Clustering would be beneficial for the tables based on the sizes. Below is the size distribution of tables in my environment:
| Large (> 1 TB) | 3 |
| Medium (10 GB – 1 TB) | 284 |
| Small (< 10 GB) | 17,266 |
Given this distribution, would Liquid Clustering provide significant performance improvements, or is it better suited for environments with more large tables?
Any best practices or considerations would be greatly appreciated.
3 weeks ago
Greetings @pooja_bhumandla
Based on your size distribution, enabling Liquid Clustering can provide meaningful gains—but you’ll get the highest ROI by prioritizing your medium and large tables first and selectively applying it to small tables where query patterns warrant it.
Liquid Clustering replaces partitioning and ZORDER and is recommended for new Delta tables to simplify layout decisions and improve data skipping and file pruning for filtered queries.
It is particularly beneficial for tables that are frequently filtered by high‑cardinality columns, have skewed distributions, grow quickly, see concurrent writes, or have evolving access patterns.
For very small tables, Automatic Liquid Clustering may choose not to select keys because the table is too small to benefit; LC’s gains are workload-dependent and cost-aware.
Large (>1 TB) — 3 tables: Strong candidates. Enable LC and cluster by the most common filter columns; run an initial OPTIMIZE FULL to recluster existing data and schedule ongoing OPTIMIZE (or enable Predictive Optimization). Expect noticeable read speedups and simpler maintenance compared to partitioning/ZORDER.
Medium (10 GB – 1 TB) — 284 tables: High ROI segment. LC generally improves file pruning and read latencies; recommended for new and existing Delta tables. Avoid over‑specifying clustering keys on smaller side of this range (see key count guidance below).
Small (< 10 GB) — 17,266 tables: Apply LC selectively. If the tables are mostly broadcast joined, scanned in full, or used as small dimensions, LC offers limited benefit versus simply relying on broadcast joins, table stats, and caching.
If these small tables are frequently filtered on high‑cardinality columns, LC can still help with data skipping, but Automatic LC may skip keys if they’re very small; weigh OPTIMIZE costs vs. query savings.Start with 1–2 keys most frequently used in filters, especially high‑cardinality columns (e.g., timestamp, user_id). On smaller tables (under ~10 TB), too many keys can hurt single‑column filter performance; cap at two unless multi‑column filters dominate.
If you previously partitioned or Z‑ordered, reuse those columns as LC keys (partition columns, ZORDER BY columns).
-- New table
CREATE TABLE my_uc_tbl (...) CLUSTER BY AUTO;
-- Existing table
ALTER TABLE my_uc_tbl CLUSTER BY AUTO;
OPTIMIZE regularly; it incrementally reclusters only what’s needed. Without Predictive Optimization, schedule OPTIMIZE hourly or every few hours on highly updated tables; with PO, disable manual schedules and let PO run it.
OPTIMIZE my_tbl; -- incremental reclustering
OPTIMIZE my_tbl FULL; -- force reclustering of all records (first enable or key change)
OPTIMIZE FULL once when enabling LC or changing keys; this can take hours but sets a clean baseline.spark.databricks.delta.liquid.eagerClustering.streaming.enabled=true, but thresholds apply and many streaming micro-batches won’t trigger; OPTIMIZE catches up incrementally.ANALYZE TABLE ... COMPUTE STATISTICS FOR ALL COLUMNS), and use Delta caching for repeated access; these tend to deliver more benefit than LC for tiny lookup/dimension tables.Avoid partitioning tables under ~1 TB; LC (or ingestion‑time clustering + auto‑optimize) is a better choice to prevent small‑file and skew issues while retaining data skipping.
Use Predictive Optimization on UC managed tables to automate OPTIMIZE/VACUUM and LC maintenance across your estate.
Prioritize enabling LC (or LC AUTO) on the 3 large and 284 medium tables, beginning with those most queried via filters and those with skew/concurrency; measure query latency and scan bytes before/after.
For the 17k+ small tables, only enable LC where the workload is filter‑heavy on high‑cardinality columns; otherwise rely on broadcast/stats/caching for best overall cost‑performance and leave LC to PO where it decides it’s worthwhile.
Hope this helps, Louis.
3 weeks ago
Greetings @pooja_bhumandla
Based on your size distribution, enabling Liquid Clustering can provide meaningful gains—but you’ll get the highest ROI by prioritizing your medium and large tables first and selectively applying it to small tables where query patterns warrant it.
Liquid Clustering replaces partitioning and ZORDER and is recommended for new Delta tables to simplify layout decisions and improve data skipping and file pruning for filtered queries.
It is particularly beneficial for tables that are frequently filtered by high‑cardinality columns, have skewed distributions, grow quickly, see concurrent writes, or have evolving access patterns.
For very small tables, Automatic Liquid Clustering may choose not to select keys because the table is too small to benefit; LC’s gains are workload-dependent and cost-aware.
Large (>1 TB) — 3 tables: Strong candidates. Enable LC and cluster by the most common filter columns; run an initial OPTIMIZE FULL to recluster existing data and schedule ongoing OPTIMIZE (or enable Predictive Optimization). Expect noticeable read speedups and simpler maintenance compared to partitioning/ZORDER.
Medium (10 GB – 1 TB) — 284 tables: High ROI segment. LC generally improves file pruning and read latencies; recommended for new and existing Delta tables. Avoid over‑specifying clustering keys on smaller side of this range (see key count guidance below).
Small (< 10 GB) — 17,266 tables: Apply LC selectively. If the tables are mostly broadcast joined, scanned in full, or used as small dimensions, LC offers limited benefit versus simply relying on broadcast joins, table stats, and caching.
If these small tables are frequently filtered on high‑cardinality columns, LC can still help with data skipping, but Automatic LC may skip keys if they’re very small; weigh OPTIMIZE costs vs. query savings.Start with 1–2 keys most frequently used in filters, especially high‑cardinality columns (e.g., timestamp, user_id). On smaller tables (under ~10 TB), too many keys can hurt single‑column filter performance; cap at two unless multi‑column filters dominate.
If you previously partitioned or Z‑ordered, reuse those columns as LC keys (partition columns, ZORDER BY columns).
-- New table
CREATE TABLE my_uc_tbl (...) CLUSTER BY AUTO;
-- Existing table
ALTER TABLE my_uc_tbl CLUSTER BY AUTO;
OPTIMIZE regularly; it incrementally reclusters only what’s needed. Without Predictive Optimization, schedule OPTIMIZE hourly or every few hours on highly updated tables; with PO, disable manual schedules and let PO run it.
OPTIMIZE my_tbl; -- incremental reclustering
OPTIMIZE my_tbl FULL; -- force reclustering of all records (first enable or key change)
OPTIMIZE FULL once when enabling LC or changing keys; this can take hours but sets a clean baseline.spark.databricks.delta.liquid.eagerClustering.streaming.enabled=true, but thresholds apply and many streaming micro-batches won’t trigger; OPTIMIZE catches up incrementally.ANALYZE TABLE ... COMPUTE STATISTICS FOR ALL COLUMNS), and use Delta caching for repeated access; these tend to deliver more benefit than LC for tiny lookup/dimension tables.Avoid partitioning tables under ~1 TB; LC (or ingestion‑time clustering + auto‑optimize) is a better choice to prevent small‑file and skew issues while retaining data skipping.
Use Predictive Optimization on UC managed tables to automate OPTIMIZE/VACUUM and LC maintenance across your estate.
Prioritize enabling LC (or LC AUTO) on the 3 large and 284 medium tables, beginning with those most queried via filters and those with skew/concurrency; measure query latency and scan bytes before/after.
For the 17k+ small tables, only enable LC where the workload is filter‑heavy on high‑cardinality columns; otherwise rely on broadcast/stats/caching for best overall cost‑performance and leave LC to PO where it decides it’s worthwhile.
Hope this helps, Louis.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now