cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

What is the disadvantage of using multiple Z-Order columns?

qwerty1
Contributor

The documentation states

You can specify multiple columns for  ZORDER BY as a comma-separated list. However, the effectiveness of the locality drops with each extra column

What does it mean for "effectiveness of the locality to drop" with each extra column? For eg: I would like to query my delta table with 20 columns mostly using predicates on source and completedAt . Here source can have ~500 distinct values and completedAt is a timestamp where I would like to query records after/before a particular timestamp.

In the above scenario would the effectiveness of the Z order query drop if I just used source in predicate?

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@Ashwin Bhaskarโ€‹ :

Z-ordering is a technique to improve the performance of queries that involve filtering and grouping on specific columns in a large distributed database. When a table is z-ordered on a certain column or set of columns, the data is sorted based on the values of those columns, and stored in a way that maximizes the locality of the data on the storage system.

When multiple columns are used for Z-ordering, the effectiveness of the locality drops because the data is sorted based on multiple columns, and it is more difficult to ensure that the data with the same value combinations of those columns is stored together. This means that queries that use only one of the columns specified in the Z-ordering will not be able to take full advantage of the improved locality provided by the Z-ordering.

In your scenario, if you only use the "source" column in your query, then using it as the sole Z-order column will likely provide better performance than using multiple columns. However, if you frequently query on combinations of the "source" and "completedAt" columns, then using both columns in the Z-ordering may still provide a performance benefit. Ultimately, the decision on how to Z-order the columns should be based on the specific usage patterns of the data and the queries being performed.

View solution in original post

1 REPLY 1

Anonymous
Not applicable

@Ashwin Bhaskarโ€‹ :

Z-ordering is a technique to improve the performance of queries that involve filtering and grouping on specific columns in a large distributed database. When a table is z-ordered on a certain column or set of columns, the data is sorted based on the values of those columns, and stored in a way that maximizes the locality of the data on the storage system.

When multiple columns are used for Z-ordering, the effectiveness of the locality drops because the data is sorted based on multiple columns, and it is more difficult to ensure that the data with the same value combinations of those columns is stored together. This means that queries that use only one of the columns specified in the Z-ordering will not be able to take full advantage of the improved locality provided by the Z-ordering.

In your scenario, if you only use the "source" column in your query, then using it as the sole Z-order column will likely provide better performance than using multiple columns. However, if you frequently query on combinations of the "source" and "completedAt" columns, then using both columns in the Z-ordering may still provide a performance benefit. Ultimately, the decision on how to Z-order the columns should be based on the specific usage patterns of the data and the queries being performed.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.