cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Warehousing & Analytics
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

what are the benefits to do use Z-Ordering

jose_gonzalez
Moderator
Moderator

I would like to know why and when I will do a Z-Ordering on my Delta table?

1 ACCEPTED SOLUTION

Accepted Solutions

jose_gonzalez
Moderator
Moderator

Z-ordering will help you to improve query speed. You can run your Z-ordering when you execute your Optimize jobs. For more details please check the docs https://docs.databricks.com/delta/optimizations/file-mgmt.html#z-ordering-multi-dimensional-clusteri...

View solution in original post

2 REPLIES 2

jose_gonzalez
Moderator
Moderator

Z-ordering will help you to improve query speed. You can run your Z-ordering when you execute your Optimize jobs. For more details please check the docs https://docs.databricks.com/delta/optimizations/file-mgmt.html#z-ordering-multi-dimensional-clusteri...

User16857281974
Contributor

Apache Spark does not have the features of a relational database wherein you can do a search on a primary key for example. It is forced to read in 100% of the data (generally speaking), which hurts performance at Gigabyte+ scales and test every single value in your query.

The closest Spark can get to a traditional index is disk partitioning. Disk partitioning allows you to write part-files to a directory and it is a form of index wherein you get the benefit of reading in all the data from all part files but only from the one directly. An example of this would be partitioning 20 years of data by year - a query for all data in 1999 would read in only those part files in the directory for 1999. This is good for low-cardinality searches (e.g. 1 year out of 20, once city out 10,000).

Z-Ordering brings to Apache Spark the ability to execute high-cardinality searches, or that needle-in-the-haystack search, that relational databases are so good at. In short, it can find that one, unique transaction record out of a Petabyte of transitions at speeds that are just unparalleled compared to the stock solution of scanning every single record of that Petabyte-sized dataset. It does this by leveraging Delta's log file and tracking in which part-file your record (or rather index's record) is located in. This means it reads in only those part-files that the index "might" be as opposed to 100% of the data, or if you are partitioning, an entire disk-partition of data.

You can see this all in action here: https://www.databricks.training/spark-ui-simulator/experiment-1337/v002-S/index.html. If you look at Cmd 3, it established a baseline read on a terabyte datasets for just one record at 17 minutes using 128 cores. In Cmd 4, you can see that same query execute in just under 3 minutes against the exact same datasets, only Z-Orderd.

To be very specific on the "when" part of your question, you want to use Z-Ordering when your query represents that needle-in-the-haystack type of query, or to relate it back to traditional databases, for your primary keys and unique indexes.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.