Databricks Community

Rupa0503 · yesterday

I want to understand difference b/w Liquid Clustering VS Z-ordering and also how both works?

ShamenParis · yesterday

Liquid Clustering is basically the modern replacement for Z-ordering. Both are great for data skipping (faster reads), but Liquid fixes a lot of Z-order's headaches.

How They Work (and why Liquid wins)

Z-Ordering: It's rigid. When you add new data and run OPTIMIZE, it often has to rewrite a ton of your existing files to keep things sorted. It's slow and computationally expensive.
Liquid Clustering: It's flexible and incremental. When you optimize, Databricks only processes what it needs to. It's way faster to update, handles skewed data better, and lets you change clustering keys without rewriting the whole table.

How to Use It / Migrate Moving from Z-order to Liquid is super easy using ALTER TABLE:

Use Standard Liquid: ALTER TABLE table CLUSTER BY (col1, col2) (Just remember to run OPTIMIZE afterward!)
Use Auto Liquid: ALTER TABLE table CLUSTER BY AUTO (Note: requires Predictive Optimization enabled)
Turn it off: ALTER TABLE table CLUSTER BY NONE

My Personal Benchmarks & Recommendation I tested Z-order, Standard Liquid, and Auto Liquid with the exact same data and tables. Here is the verdict:

Reads: All three perform about the same.
Writes/Optimization: Auto Liquid is definitely the fastest.
Cost (My Pick): I personally stick to Standard Liquid Clustering to save money. Auto Liquid uses Predictive Optimization, which runs on Serverless compute and adds extra costs. Standard Liquid gives you all the incremental speed benefits over Z-order, but leaves you in control of your compute bill!

balajij8 · yesterday

@Rupa0503

Both are optimization approaches for Delta Lake query performance but differ in flexibility and maintenance.

Z-Ordering is an optimization approach that co locates related data across multiple columns within files based on the setup you create.

You manually specify columns via OPTIMIZE table ZORDER BY (col1, col2) and run OPTIMIZE periodically to maintain layout as data grows. It's ideal for stable legacy read heavy workloads with predictable filter patterns
During OPTIMIZE, files are rewritten to interleave values across specified dimensions improving multi column filter skipping.
You can use Z Ordering for legacy tables with stable low-cardinality filters

Liquid Clustering is the modern & my recommended approach for new tables. It uses a tree-based algorithm to incrementally organize data by clustering keys without full rewrites.

Dynamic: Change clustering keys anytime via CLUSTER BY (cols) without rewriting existing data
Automatic & Incremental: Supports CLUSTER BY AUTO to allow Databricks select optimal keys based on query history.
Handles complexity: Better for high-cardinality columns, skewed data or evolving query pattern
Use Liquid Clustering for new tables with high-cardinality filters, concurrent writes or when query patterns evolve

More details here

Ashwin_DSA · yesterday

Hi @Rupa0503,

In simple terms... both Liquid Clustering and Z-ordering are ways to improve data layout so Databricks can skip more irrelevant files during reads, but they are not the same thing.

If I had to summarise it simply... Z-ordering is the older, more manual way to colocate related values in the same set of files, while Liquid Clustering is the newer, more flexible approach that Databricks now recommends for new tables.

A practical difference is that Liquid Clustering is designed to replace both partitioning and ZORDER for table layout, and it is not compatible with Z-ordering on the same table.

Here’s the intuitive version of how they work:

Z-ordering reorganises data so rows with similar values across the chosen columns end up physically closer together in the same files. Databricks then uses file-level statistics, such as min/max values, to skip files unlikely to match the query.
Liquid Clustering also improves file skipping, but instead of you having to rely on a more rigid layout strategy, it organizes data around clustering keys and lets that layout evolve more easily over time.

One nice way to think about it is this.. Imagine a warehouse.

Z-ordering is like manually rearranging the shelves so items that are often requested together are stored near each other. That helps workers walk less, but if demand patterns change, you may need to reorganise again.
Liquid Clustering is more like having a smarter warehouse layout system where you define the important access dimensions, and the system keeps organising incoming inventory around those keys with much less rigidity. If your access pattern changes, you can change the clustering keys without the same kind of full redesign you would normally worry about with traditional partitioning.

A few concrete differences that usually help..

With Z-ordering, you typically run OPTIMIZE ... ZORDER BY (...) to rewrite data for the columns you care about. It works, but it is more of an explicit maintenance choice. With Liquid Clustering, you define clustering using CLUSTER BY, and Databricks can incrementally cluster data with OPTIMIZE. You can also redefine clustering keys later without rewriting existing data.
Databricks recommends Liquid Clustering for all new tables, whereas Z-ordering is still documented but no longer the recommended default for new layouts.
Z-ordering can work well when queries frequently filter on high-cardinality columns, but its effectiveness drops as you add more columns to the ZORDER list. With Liquid Clustering, the docs explicitly say key order does not matter, which removes one more tuning decision from the user.
If query patterns change over time, Liquid Clustering is better suited to that because the clustering definition can evolve. With automatic liquid clustering on supported tables, Databricks can even analyze historical query workload and choose or adapt keys for you.

So if you want a simple rule of thumb... a) Use Liquid Clustering for new tables. b) Think of Z-ordering mainly as the older layout optimisation mechanism you may still see on existing tables. c) Don’t use both together on the same table.

To @ShamenParis point around costs... Standard Liquid Clustering can be a good fit if you want the benefits of Liquid Clustering but prefer to control when maintenance runs and what compute it uses. Automatic Liquid Clustering depends on Predictive Optimisation, which runs maintenance on serverless jobs compute and is billed separately. That said, Auto Liquid is designed to be cost-aware and can reduce overall TCO when the performance gains justify the maintenance cost. I wouldn't classify that as an expensive mode.

If useful, the official docs here are the best references... Liquid Clustering docs and Data skipping and Z-ordering docs.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***