cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Should I enable Liquid Clustering based on table size distribution?

pooja_bhumandla
New Contributor III

Hi everyone,

I’m evaluating whether Liquid Clustering would be beneficial for the tables based on the sizes. Below is the size distribution of tables in my environment:

Size Bucket Table Count
Large (> 1 TB)3
Medium (10 GB – 1 TB)284
Small (< 10 GB)17,266

Given this distribution, would Liquid Clustering provide significant performance improvements, or is it better suited for environments with more large tables?

Any best practices or considerations would be greatly appreciated.

1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

Greetings @pooja_bhumandla 

Based on your size distribution, enabling Liquid Clustering can provide meaningful gains—but you’ll get the highest ROI by prioritizing your medium and large tables first and selectively applying it to small tables where query patterns warrant it.

What Liquid Clustering is best for

  • Liquid Clustering replaces partitioning and ZORDER and is recommended for new Delta tables to simplify layout decisions and improve data skipping and file pruning for filtered queries.

  • It is particularly beneficial for tables that are frequently filtered by high‑cardinality columns, have skewed distributions, grow quickly, see concurrent writes, or have evolving access patterns.

  • For very small tables, Automatic Liquid Clustering may choose not to select keys because the table is too small to benefit; LC’s gains are workload-dependent and cost-aware.

Assessment for your environment

  • Large (>1 TB) — 3 tables: Strong candidates. Enable LC and cluster by the most common filter columns; run an initial OPTIMIZE FULL to recluster existing data and schedule ongoing OPTIMIZE (or enable Predictive Optimization). Expect noticeable read speedups and simpler maintenance compared to partitioning/ZORDER.

  • Medium (10 GB – 1 TB) — 284 tables: High ROI segment. LC generally improves file pruning and read latencies; recommended for new and existing Delta tables. Avoid over‑specifying clustering keys on smaller side of this range (see key count guidance below).

  • Small (< 10 GB) — 17,266 tables: Apply LC selectively. If the tables are mostly broadcast joined, scanned in full, or used as small dimensions, LC offers limited benefit versus simply relying on broadcast joins, table stats, and caching.

    If these small tables are frequently filtered on high‑cardinality columns, LC can still help with data skipping, but Automatic LC may skip keys if they’re very small; weigh OPTIMIZE costs vs. query savings.

Expected impact

  • Customers have observed read performance improvements of 2–12x versus traditional partitioning/ZORDER methods, depending on workload and key selection.

Best practices and rollout plan

1) Choose the right clustering keys

  • Start with 1–2 keys most frequently used in filters, especially high‑cardinality columns (e.g., timestamp, user_id). On smaller tables (under ~10 TB), too many keys can hurt single‑column filter performance; cap at two unless multi‑column filters dominate.

  • If you previously partitioned or Z‑ordered, reuse those columns as LC keys (partition columns, ZORDER BY columns).

2) Let Databricks pick keys where possible

  • For Unity Catalog managed tables, enable CLUSTER BY AUTO (requires Predictive Optimization) so Databricks analyzes query history and selects/adapts keys automatically when the predicted savings outweigh clustering costs.
     
-- New table
CREATE TABLE my_uc_tbl (...) CLUSTER BY AUTO;

-- Existing table
ALTER TABLE my_uc_tbl CLUSTER BY AUTO;
 

3) Trigger clustering and keep it incremental

  • Use OPTIMIZE regularly; it incrementally reclusters only what’s needed. Without Predictive Optimization, schedule OPTIMIZE hourly or every few hours on highly updated tables; with PO, disable manual schedules and let PO run it.
     
OPTIMIZE my_tbl;            -- incremental reclustering
OPTIMIZE my_tbl FULL;       -- force reclustering of all records (first enable or key change)
  • For large tables, run OPTIMIZE FULL once when enabling LC or changing keys; this can take hours but sets a clean baseline.

4) Understand clustering-on-write thresholds

  • Clustering on write triggers only when an individual transaction meets size thresholds; if your inserts are small, rely on periodic OPTIMIZE to finish clustering. Thresholds (UC managed tables): ~64 MB for 1 key, 256 MB for 2 keys, 512 MB for 3 keys, 1 GB for 4 keys.

5) Streaming and ingestion considerations

  • Structured Streaming can cluster on write by enabling spark.databricks.delta.liquid.eagerClustering.streaming.enabled=true, but thresholds apply and many streaming micro-batches won’t trigger; OPTIMIZE catches up incrementally.

6) Compatibility and runtime

  • LC requires upgraded Delta protocols (writer v7, reader v3) and deletion vectors; you cannot downgrade table protocol versions. LC is GA for Delta Lake in DBR 15.2+, and Databricks recommends recent runtimes for best performance (e.g., DBR 17.2+ improves OPTIMIZE performance on large tables).

7) Alternatives for small tables

  • Prefer broadcast joins, keep table statistics up to date (ANALYZE TABLE ... COMPUTE STATISTICS FOR ALL COLUMNS), and use Delta caching for repeated access; these tend to deliver more benefit than LC for tiny lookup/dimension tables.

😎 General table layout guidance

  • Avoid partitioning tables under ~1 TB; LC (or ingestion‑time clustering + auto‑optimize) is a better choice to prevent small‑file and skew issues while retaining data skipping.

  • Use Predictive Optimization on UC managed tables to automate OPTIMIZE/VACUUM and LC maintenance across your estate.

Recommended approach for your mix

  • Prioritize enabling LC (or LC AUTO) on the 3 large and 284 medium tables, beginning with those most queried via filters and those with skew/concurrency; measure query latency and scan bytes before/after.

  • For the 17k+ small tables, only enable LC where the workload is filter‑heavy on high‑cardinality columns; otherwise rely on broadcast/stats/caching for best overall cost‑performance and leave LC to PO where it decides it’s worthwhile.

    Hope this helps, Louis.