Databricks Community

jose_gonzalez · ‎06-10-2021

I would like to know why and when I will do a Z-Ordering on my Delta table?

jose_gonzalez · ‎06-11-2021

Z-ordering will help you to improve query speed. You can run your Z-ordering when you execute your Optimize jobs. For more details please check the docs https://docs.databricks.com/delta/optimizations/file-mgmt.html#z-ordering-multi-dimensional-clusteri...

View solution in original post

jose_gonzalez · ‎06-11-2021

Z-ordering will help you to improve query speed. You can run your Z-ordering when you execute your Optimize jobs. For more details please check the docs https://docs.databricks.com/delta/optimizations/file-mgmt.html#z-ordering-multi-dimensional-clusteri...

User16857281974 · ‎07-30-2021

Apache Spark does not have the features of a relational database wherein you can do a search on a primary key for example. It is forced to read in 100% of the data (generally speaking), which hurts performance at Gigabyte+ scales and test every single value in your query.

The closest Spark can get to a traditional index is disk partitioning. Disk partitioning allows you to write part-files to a directory and it is a form of index wherein you get the benefit of reading in all the data from all part files but only from the one directly. An example of this would be partitioning 20 years of data by year - a query for all data in 1999 would read in only those part files in the directory for 1999. This is good for low-cardinality searches (e.g. 1 year out of 20, once city out 10,000).

Z-Ordering brings to Apache Spark the ability to execute high-cardinality searches, or that needle-in-the-haystack search, that relational databases are so good at. In short, it can find that one, unique transaction record out of a Petabyte of transitions at speeds that are just unparalleled compared to the stock solution of scanning every single record of that Petabyte-sized dataset. It does this by leveraging Delta's log file and tracking in which part-file your record (or rather index's record) is located in. This means it reads in only those part-files that the index "might" be as opposed to 100% of the data, or if you are partitioning, an entire disk-partition of data.

You can see this all in action here: https://www.databricks.training/spark-ui-simulator/experiment-1337/v002-S/index.html. If you look at Cmd 3, it established a baseline read on a terabyte datasets for just one record at 17 minutes using 128 cores. In Cmd 4, you can see that same query execute in just under 3 minutes against the exact same datasets, only Z-Orderd.

To be very specific on the "when" part of your question, you want to use Z-Ordering when your query represents that needle-in-the-haystack type of query, or to relate it back to traditional databases, for your primary keys and unique indexes.

Databricks Community

what are the benefits to do use Z-Ordering

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! October 31 – November 06, 2025

Free Edition Hackathon

🚀 Announcing the Databricks Data Intelligence Platform Cheat Sheet

Zerobus Ingest in Action: How to Stream Event Data Directly into Your Lakehouse

Find Sensitive Data at Scale with Data Classification in Unity Catalog