cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

What is the disadvantage of using multiple Z-Order columns?

qwerty1
Contributor

The documentation states

You can specify multiple columns for  ZORDER BY as a comma-separated list. However, the effectiveness of the locality drops with each extra column

What does it mean for "effectiveness of the locality to drop" with each extra column? For eg: I would like to query my delta table with 20 columns mostly using predicates on source and completedAt . Here source can have ~500 distinct values and completedAt is a timestamp where I would like to query records after/before a particular timestamp.

In the above scenario would the effectiveness of the Z order query drop if I just used source in predicate?

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@Ashwin Bhaskarโ€‹ :

Z-ordering is a technique to improve the performance of queries that involve filtering and grouping on specific columns in a large distributed database. When a table is z-ordered on a certain column or set of columns, the data is sorted based on the values of those columns, and stored in a way that maximizes the locality of the data on the storage system.

When multiple columns are used for Z-ordering, the effectiveness of the locality drops because the data is sorted based on multiple columns, and it is more difficult to ensure that the data with the same value combinations of those columns is stored together. This means that queries that use only one of the columns specified in the Z-ordering will not be able to take full advantage of the improved locality provided by the Z-ordering.

In your scenario, if you only use the "source" column in your query, then using it as the sole Z-order column will likely provide better performance than using multiple columns. However, if you frequently query on combinations of the "source" and "completedAt" columns, then using both columns in the Z-ordering may still provide a performance benefit. Ultimately, the decision on how to Z-order the columns should be based on the specific usage patterns of the data and the queries being performed.

View solution in original post

1 REPLY 1

Anonymous
Not applicable

@Ashwin Bhaskarโ€‹ :

Z-ordering is a technique to improve the performance of queries that involve filtering and grouping on specific columns in a large distributed database. When a table is z-ordered on a certain column or set of columns, the data is sorted based on the values of those columns, and stored in a way that maximizes the locality of the data on the storage system.

When multiple columns are used for Z-ordering, the effectiveness of the locality drops because the data is sorted based on multiple columns, and it is more difficult to ensure that the data with the same value combinations of those columns is stored together. This means that queries that use only one of the columns specified in the Z-ordering will not be able to take full advantage of the improved locality provided by the Z-ordering.

In your scenario, if you only use the "source" column in your query, then using it as the sole Z-order column will likely provide better performance than using multiple columns. However, if you frequently query on combinations of the "source" and "completedAt" columns, then using both columns in the Z-ordering may still provide a performance benefit. Ultimately, the decision on how to Z-order the columns should be based on the specific usage patterns of the data and the queries being performed.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group