Databricks Community

pooja_bhumandla · 3 weeks ago

Hi everyone,

I’m evaluating whether Liquid Clustering would be beneficial for the tables based on the sizes. Below is the size distribution of tables in my environment:

Size Bucket Table Count

Large (> 1 TB)	3
Medium (10 GB – 1 TB)	284
Small (< 10 GB)	17,266

Given this distribution, would Liquid Clustering provide significant performance improvements, or is it better suited for environments with more large tables?

Any best practices or considerations would be greatly appreciated.

Louis_Frolio · 3 weeks ago

Greetings @pooja_bhumandla

Based on your size distribution, enabling Liquid Clustering can provide meaningful gains—but you’ll get the highest ROI by prioritizing your medium and large tables first and selectively applying it to small tables where query patterns warrant it.

What Liquid Clustering is best for

Liquid Clustering replaces partitioning and ZORDER and is recommended for new Delta tables to simplify layout decisions and improve data skipping and file pruning for filtered queries.
It is particularly beneficial for tables that are frequently filtered by high‑cardinality columns, have skewed distributions, grow quickly, see concurrent writes, or have evolving access patterns.
For very small tables, Automatic Liquid Clustering may choose not to select keys because the table is too small to benefit; LC’s gains are workload-dependent and cost-aware.

Assessment for your environment

Large (>1 TB) — 3 tables: Strong candidates. Enable LC and cluster by the most common filter columns; run an initial OPTIMIZE FULL to recluster existing data and schedule ongoing OPTIMIZE (or enable Predictive Optimization). Expect noticeable read speedups and simpler maintenance compared to partitioning/ZORDER.
Medium (10 GB – 1 TB) — 284 tables: High ROI segment. LC generally improves file pruning and read latencies; recommended for new and existing Delta tables. Avoid over‑specifying clustering keys on smaller side of this range (see key count guidance below).
Small (< 10 GB) — 17,266 tables: Apply LC selectively. If the tables are mostly broadcast joined, scanned in full, or used as small dimensions, LC offers limited benefit versus simply relying on broadcast joins, table stats, and caching.
If these small tables are frequently filtered on high‑cardinality columns, LC can still help with data skipping, but Automatic LC may skip keys if they’re very small; weigh OPTIMIZE costs vs. query savings.

Expected impact

Customers have observed read performance improvements of 2–12x versus traditional partitioning/ZORDER methods, depending on workload and key selection.

Best practices and rollout plan

1) Choose the right clustering keys

Start with 1–2 keys most frequently used in filters, especially high‑cardinality columns (e.g., timestamp, user_id). On smaller tables (under ~10 TB), too many keys can hurt single‑column filter performance; cap at two unless multi‑column filters dominate.
If you previously partitioned or Z‑ordered, reuse those columns as LC keys (partition columns, ZORDER BY columns).

2) Let Databricks pick keys where possible

For Unity Catalog managed tables, enable CLUSTER BY AUTO (requires Predictive Optimization) so Databricks analyzes query history and selects/adapts keys automatically when the predicted savings outweigh clustering costs.

-- New table
CREATE TABLE my_uc_tbl (...) CLUSTER BY AUTO;

-- Existing table
ALTER TABLE my_uc_tbl CLUSTER BY AUTO;

3) Trigger clustering and keep it incremental

Use OPTIMIZE regularly; it incrementally reclusters only what’s needed. Without Predictive Optimization, schedule OPTIMIZE hourly or every few hours on highly updated tables; with PO, disable manual schedules and let PO run it.

OPTIMIZE my_tbl;            -- incremental reclustering
OPTIMIZE my_tbl FULL;       -- force reclustering of all records (first enable or key change)

For large tables, run OPTIMIZE FULL once when enabling LC or changing keys; this can take hours but sets a clean baseline.

4) Understand clustering-on-write thresholds

Clustering on write triggers only when an individual transaction meets size thresholds; if your inserts are small, rely on periodic OPTIMIZE to finish clustering. Thresholds (UC managed tables): ~64 MB for 1 key, 256 MB for 2 keys, 512 MB for 3 keys, 1 GB for 4 keys.

5) Streaming and ingestion considerations

Structured Streaming can cluster on write by enabling spark.databricks.delta.liquid.eagerClustering.streaming.enabled=true, but thresholds apply and many streaming micro-batches won’t trigger; OPTIMIZE catches up incrementally.

6) Compatibility and runtime

LC requires upgraded Delta protocols (writer v7, reader v3) and deletion vectors; you cannot downgrade table protocol versions. LC is GA for Delta Lake in DBR 15.2+, and Databricks recommends recent runtimes for best performance (e.g., DBR 17.2+ improves OPTIMIZE performance on large tables).

7) Alternatives for small tables

Prefer broadcast joins, keep table statistics up to date (ANALYZE TABLE ... COMPUTE STATISTICS FOR ALL COLUMNS), and use Delta caching for repeated access; these tend to deliver more benefit than LC for tiny lookup/dimension tables.

😎 General table layout guidance

Avoid partitioning tables under ~1 TB; LC (or ingestion‑time clustering + auto‑optimize) is a better choice to prevent small‑file and skew issues while retaining data skipping.
Use Predictive Optimization on UC managed tables to automate OPTIMIZE/VACUUM and LC maintenance across your estate.

Recommended approach for your mix

Prioritize enabling LC (or LC AUTO) on the 3 large and 284 medium tables, beginning with those most queried via filters and those with skew/concurrency; measure query latency and scan bytes before/after.
For the 17k+ small tables, only enable LC where the workload is filter‑heavy on high‑cardinality columns; otherwise rely on broadcast/stats/caching for best overall cost‑performance and leave LC to PO where it decides it’s worthwhile.

Hope this helps, Louis.

View solution in original post

Louis_Frolio · 3 weeks ago

Greetings @pooja_bhumandla

Based on your size distribution, enabling Liquid Clustering can provide meaningful gains—but you’ll get the highest ROI by prioritizing your medium and large tables first and selectively applying it to small tables where query patterns warrant it.

What Liquid Clustering is best for

Liquid Clustering replaces partitioning and ZORDER and is recommended for new Delta tables to simplify layout decisions and improve data skipping and file pruning for filtered queries.
It is particularly beneficial for tables that are frequently filtered by high‑cardinality columns, have skewed distributions, grow quickly, see concurrent writes, or have evolving access patterns.
For very small tables, Automatic Liquid Clustering may choose not to select keys because the table is too small to benefit; LC’s gains are workload-dependent and cost-aware.

Assessment for your environment

Large (>1 TB) — 3 tables: Strong candidates. Enable LC and cluster by the most common filter columns; run an initial OPTIMIZE FULL to recluster existing data and schedule ongoing OPTIMIZE (or enable Predictive Optimization). Expect noticeable read speedups and simpler maintenance compared to partitioning/ZORDER.
Medium (10 GB – 1 TB) — 284 tables: High ROI segment. LC generally improves file pruning and read latencies; recommended for new and existing Delta tables. Avoid over‑specifying clustering keys on smaller side of this range (see key count guidance below).
Small (< 10 GB) — 17,266 tables: Apply LC selectively. If the tables are mostly broadcast joined, scanned in full, or used as small dimensions, LC offers limited benefit versus simply relying on broadcast joins, table stats, and caching.
If these small tables are frequently filtered on high‑cardinality columns, LC can still help with data skipping, but Automatic LC may skip keys if they’re very small; weigh OPTIMIZE costs vs. query savings.

Expected impact

Customers have observed read performance improvements of 2–12x versus traditional partitioning/ZORDER methods, depending on workload and key selection.

Best practices and rollout plan

1) Choose the right clustering keys

Start with 1–2 keys most frequently used in filters, especially high‑cardinality columns (e.g., timestamp, user_id). On smaller tables (under ~10 TB), too many keys can hurt single‑column filter performance; cap at two unless multi‑column filters dominate.
If you previously partitioned or Z‑ordered, reuse those columns as LC keys (partition columns, ZORDER BY columns).

2) Let Databricks pick keys where possible

For Unity Catalog managed tables, enable CLUSTER BY AUTO (requires Predictive Optimization) so Databricks analyzes query history and selects/adapts keys automatically when the predicted savings outweigh clustering costs.

-- New table
CREATE TABLE my_uc_tbl (...) CLUSTER BY AUTO;

-- Existing table
ALTER TABLE my_uc_tbl CLUSTER BY AUTO;

3) Trigger clustering and keep it incremental

Use OPTIMIZE regularly; it incrementally reclusters only what’s needed. Without Predictive Optimization, schedule OPTIMIZE hourly or every few hours on highly updated tables; with PO, disable manual schedules and let PO run it.

OPTIMIZE my_tbl;            -- incremental reclustering
OPTIMIZE my_tbl FULL;       -- force reclustering of all records (first enable or key change)

For large tables, run OPTIMIZE FULL once when enabling LC or changing keys; this can take hours but sets a clean baseline.

4) Understand clustering-on-write thresholds

Clustering on write triggers only when an individual transaction meets size thresholds; if your inserts are small, rely on periodic OPTIMIZE to finish clustering. Thresholds (UC managed tables): ~64 MB for 1 key, 256 MB for 2 keys, 512 MB for 3 keys, 1 GB for 4 keys.

5) Streaming and ingestion considerations

Structured Streaming can cluster on write by enabling spark.databricks.delta.liquid.eagerClustering.streaming.enabled=true, but thresholds apply and many streaming micro-batches won’t trigger; OPTIMIZE catches up incrementally.

6) Compatibility and runtime

LC requires upgraded Delta protocols (writer v7, reader v3) and deletion vectors; you cannot downgrade table protocol versions. LC is GA for Delta Lake in DBR 15.2+, and Databricks recommends recent runtimes for best performance (e.g., DBR 17.2+ improves OPTIMIZE performance on large tables).

7) Alternatives for small tables

Prefer broadcast joins, keep table statistics up to date (ANALYZE TABLE ... COMPUTE STATISTICS FOR ALL COLUMNS), and use Delta caching for repeated access; these tend to deliver more benefit than LC for tiny lookup/dimension tables.

😎 General table layout guidance

Avoid partitioning tables under ~1 TB; LC (or ingestion‑time clustering + auto‑optimize) is a better choice to prevent small‑file and skew issues while retaining data skipping.
Use Predictive Optimization on UC managed tables to automate OPTIMIZE/VACUUM and LC maintenance across your estate.

Recommended approach for your mix

Prioritize enabling LC (or LC AUTO) on the 3 large and 284 medium tables, beginning with those most queried via filters and those with skew/concurrency; measure query latency and scan bytes before/after.
For the 17k+ small tables, only enable LC where the workload is filter‑heavy on high‑cardinality columns; otherwise rely on broadcast/stats/caching for best overall cost‑performance and leave LC to PO where it decides it’s worthwhile.

Hope this helps, Louis.

Databricks Community

Should I enable Liquid Clustering based on table size distribution?

What Liquid Clustering is best for

Assessment for your environment

Expected impact

Best practices and rollout plan

1) Choose the right clustering keys

2) Let Databricks pick keys where possible

3) Trigger clustering and keep it incremental

4) Understand clustering-on-write thresholds

5) Streaming and ingestion considerations

6) Compatibility and runtime

7) Alternatives for small tables

😎 General table layout guidance

Recommended approach for your mix

What Liquid Clustering is best for

Assessment for your environment

Expected impact

Best practices and rollout plan

1) Choose the right clustering keys

2) Let Databricks pick keys where possible

3) Trigger clustering and keep it incremental

4) Understand clustering-on-write thresholds

5) Streaming and ingestion considerations

6) Compatibility and runtime

7) Alternatives for small tables

😎 General table layout guidance

Recommended approach for your mix

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 28 – December 04, 2025

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐