Databricks Community

pooja_bhumandla · 2 weeks ago

Hi everyone,

I’m looking for some practical guidance and experiences around when to choose Liquid Clustering versus sticking with traditional partitioning + Z-ordering.

From what I’ve gathered so far:

For small tables (<10TB), Liquid Clustering gives similar performance to traditional approaches if queries consistently filter on 1–2 columns.

For lookups on more than two columns, partitioning with Z-ordering might offer better and more predictable read performance.

The number of files and number of columns also seems to impact efficiency — too many clustering keys (e.g., 4+) may hurt performance for single-column lookups.

But I’d love to hear from others:

How do you decide when Liquid Clustering is worth it?
Have you seen clear performance gains (or drawbacks) based on table size, number of clustering columns, or file count?
Any best practices or gotchas from your real-world implementations?

Appreciate any insights, benchmarks, or rules of thumb the community can share!

Louis_Frolio · 2 weeks ago

Greeting @pooja_bhumandla ,

Thanks for laying out your current understanding — here’s practical guidance, trade-offs, and field rules-of-thumb for choosing between Liquid Clustering and traditional partitioning + Z-ordering, along with gotchas to watch for.

Quick rules of thumb

Prefer Liquid Clustering for new Delta tables and use clustering keys on the columns most frequently used in filters or joins; it is incompatible with partitioning and ZORDER, and Databricks manages the layout/optimization once enabled.
Limit clustering keys to the best 1–4 columns; clustering on too many unrelated keys dilutes data skipping benefits and predictability for single-key lookups.
Don’t partition tables under ~1TB; if you must optimize small tables, use Liquid Clustering (or Z-ordering if LC isn’t an option). If you do partition, ensure at least ~1GB per partition to avoid under-partitioning.
Run OPTIMIZE regularly on LC tables to incrementally cluster new data; for full-table reclustering after enabling LC on existing data, use OPTIMIZE FULL. If you’re on Unity Catalog, consider Predictive Optimization and CLUSTER BY AUTO to let Databricks pick and evolve keys for you.
Data skipping is your baseline accelerator: stats are collected by default on the first 32 columns; if you filter on more than 32 columns, increase the indexed columns with dataSkippingNumIndexedCols (trade-off: higher write overhead).
Metadata-only queries on partition columns can be faster with partitioning; LC tables need to scan data for such metadata-style aggregations (for example, distinct partition values), so consider this if your workload relies heavily on partition metadata.

When Liquid Clustering wins

High-cardinality filter columns: LC shines because clustering keys can be chosen purely on access patterns (no cardinality constraints), improving file pruning for point and range queries.
Evolving query patterns: You can change LC keys without rewriting existing data; with CLUSTER BY AUTO, Databricks analyzes historical workload and switches keys when predicted savings outweigh re-clustering costs.
Heavy concurrent writes/updates: LC’s incremental clustering and row-level concurrency support reduce conflicts and maintenance costs compared to ZORDER, which is more expensive and write‑amplifying.
Skewed distributions or rapidly growing tables: LC mitigates skew, avoids over/under-partitioning, and keeps file sizes consistent as data grows.
Multi-dimensional filtering across a few columns: LC’s multi-dimensional clustering improves data skipping across those keys; keep key count focused (1–4), and consider hierarchical clustering if you filter one key much more than others.

When partitioning + Z-ordering may still be preferable

If you rely on metadata-only partition operations (for example, pure partition metadata queries or file-list-based operations), partitioning can be faster because LC still scans data files.
External-engine constraints: LC requires newer Delta reader protocol features and Databricks-managed optimization; if non-Databricks engines lack compatibility, you may prefer partitioning/ZORDER for interoperability.
Established, stable partition strategy already delivering SLAs and you specifically need partition-folder isolation; LC is a better default going forward, but you can defer migration until there’s a clear benefit or a schema/access-pattern change.

Table size, number of keys, and file count: what we’ve seen

Table size: Databricks recommends not partitioning tables under ~1TB and suggests LC for new Delta tables. LC scales well for large tables; Auto Liquid can pick keys and adapt them over time for UC-managed tables.
Number of clustering keys: Aim for 1–4, chosen from the most common filters and joins; more keys can reduce clustering focus and predictability of pruning for single-key lookups. Hierarchical clustering (preview) helps when one key dominates query filters.
File size / file count: OPTIMIZE will bin-pack files; default optimized file target ~1GB can be adjusted, and practical guidance is to keep file sizes roughly tens to hundreds of MB up to ~1GB (workload-dependent). The Comprehensive Guide’s general file-size guidance of ~16MB–1GB is a good starting envelope; LC and Predictive Optimization help maintain sizes as data evolves.
Data skipping coverage: If queries often filter on columns beyond the first 32, increase the skip-stat column count (dataSkippingNumIndexedCols) with care as it raises DML costs; also try to keep frequently filtered columns among the first 32 in the schema.

Performance observations and benchmarks

LC vs partitioned+ZORDER: Field content and internal decks consistently show LC delivering faster writes and similar reads versus well-tuned partitioned tables, while reducing tuning complexity and write amplification compared to ZORDER.
Automatic Liquid Clustering results: Preview customers reported strong gains; public blogs cite up to 10x faster queries in some workloads after enabling AUTO LC on gold tables, with lower operational overhead and costs. Real-world mileage varies by query mix and data distribution, but AUTO helps de-risk key selection over time.

Best practices and gotchas

* Incompatibility: LC cannot be used with Hive partitioning or ZORDER; enabling LC requires Databricks to manage layout/optimization for that table.

Initial reclustering: LC applies incrementally; to recluster previously written data, run OPTIMIZE FULL after enabling LC, otherwise new writes are clustered as they arrive.
Metadata-style queries: Expect slower metadata-only queries on LC tables compared to partitioned tables because LC scans data rather than leveraging partition folder metadata.
Runtime requirements: For Delta LC, use DBR 15.2+; AUTO LC key selection requires DBR 15.4+; managed Iceberg LC requires DBR 16.4+.
Operational cadence: If not using Predictive Optimization, schedule periodic OPTIMIZE and VACUUM as appropriate; on UC, PO can do this automatically.

Decision workflow (pragmatic)

Identify your top filter/join columns and choose 1–3 (up to 4) as LC keys; if on UC, consider CLUSTER BY AUTO to let Predictive Optimization pick and evolve keys.
Enable LC (or AUTO LC), then run OPTIMIZE; benchmark representative queries by tracking bytes scanned, file pruning, and wall-clock times before/after. If your workload includes metadata-only partition operations, weigh those specific queries’ SLA impacts before migrating.
If you already partition a very large, stable table and rely on partition metadata, you may keep that design — otherwise LC is the recommended path forward for new tables and evolving workloads.

Hope this helps, Louis.

mark_ott · 12 hours ago

Deciding between Liquid Clustering and traditional partitioning with Z-ordering depends on table size, query patterns, number of clustering columns, and file optimization needs. For tables under 10TB with queries consistently filtered on 1–2 columns, both approaches offer similar performance, but partitioning may excel in single-column lookups and predictable filtering. For broader, multi-column queries or evolving workloads, Liquid Clustering tends to be more versatile and forgiving, especially as table sizes grow.

When Liquid Clustering Is Worthwhile

Highly dynamic or unpredictable query patterns, especially involving 3+ columns.
Large tables (10TB+) where partitioning and manual Z-ordering become harder to maintain efficiently.
Operational flexibility: You can adjust clustering columns later, unlike partitioning, which is fixed at table creation and costly to change.
Workloads needing adaptive file sizes and logical organization over physical folder-based partitioning.

Performance Observations

Liquid Clustering can provide 30–60% query speed improvement for broad, variable queries in real-world tests, and up to 40% improvement in specific analytics use cases.
For highly selective, single-partition queries (e.g. one day’s data), partitioning performs best, as Spark prunes files very efficiently using physical directories.
Too many clustering keys (3–4+) in Liquid Clustering can degrade performance for single-column filters, especially with smaller tables, because file sizes get less optimal and file skipping becomes less effective.
On very large tables, the negative effect of many clustering keys becomes less significant, and Liquid Clustering’s adaptive layout streamlines broader queries.
File count matters: Partitioning can lead to many small files if overused or if queries cross many partitions, while Liquid Clustering tends to yield fewer, larger, well-sized files, decreasing overhead for broad scans.

Best Practices and Gotchas

Choose clustering columns based on query frequency and selectivity. Avoid clustering on columns with high cardinality unless queries frequently filter on them.
For mixed workloads, start with two clustering columns and adjust based on observed query performance, especially as table size grows.
Test and monitor: Large batch inserts and optimization jobs can temporarily double table size due to Delta Lake versioning; running VACUUM after optimize is necessary to reduce retained file counts.
For production, change clustering columns only after understanding how query patterns evolve. Liquid Clustering allows easier evolution than partitioning, but abrupt changes can require major table rewrites for optimal performance.
Avoid over-partitioning or clustering on seldom-used columns. Both approaches suffer from too many keys or directories, leading to excessive metadata and small files.
Vacuum old files regularly when using Liquid Clustering, especially after significant changes or optimizations, to maintain optimal storage size.

Real-World Experiences

Many have reported substantial improvements for complex analytics and wide-table queries after switching to Liquid Clustering, especially on tables exceeding several terabytes or with shifting business needs.
Some encountered challenges with table size ballooning after optimization jobs but resolved them by splitting ingest workloads and running timely vacuum operations.
Overall, Liquid Clustering tends to win out for broad queries and evolving pipelines, while partitioning and Z-ordering remain best for predictable, high-selectivity workloads on relatively static schema.

In summary, use partitioning and Z-ordering for small, predictable workloads; shift to Liquid Clustering for large, diverse, and evolving workloads, but tune clustering keys and regularly maintain your tables for optimal results.