cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

What is Z-ordering in Delta and what are some best practices on using it?

Anand_Ladda
Honored Contributor II
1 ACCEPTED SOLUTION

Accepted Solutions

Anand_Ladda
Honored Contributor II

Z-Ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms to dramatically reduce the amount of data that needs to be read. Syntax for Z-ordering can be found here.

If you expect a column to be commonly used in query predicates and if that column has high cardinality (that is, a large number of distinct values) which might make it ineffective for PARTITIONing the table by, then use ZORDER BY instead (ex:- a table containing companies, dates where you might want to partition by company and z-order by date assuming that table collects data for several years)

You can specify multiple columns for ZORDER BY as a comma-separated list. However, the effectiveness of the locality drops with each additional column.

Important to note that you need statistics collected on columns that you Z-order by else data skipping won't take effect. Thus its important to reorder the table such that the Z-order by column(s) are in one of the first 32 columns or change the dataSkippingNumIndexedCols property

And if you learn best through visuals this is a great explainer video on Z-ordering on Delta Tables

View solution in original post

2 REPLIES 2

User16826994223
Honored Contributor III

NiCely Written

Anand_Ladda
Honored Contributor II

Z-Ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms to dramatically reduce the amount of data that needs to be read. Syntax for Z-ordering can be found here.

If you expect a column to be commonly used in query predicates and if that column has high cardinality (that is, a large number of distinct values) which might make it ineffective for PARTITIONing the table by, then use ZORDER BY instead (ex:- a table containing companies, dates where you might want to partition by company and z-order by date assuming that table collects data for several years)

You can specify multiple columns for ZORDER BY as a comma-separated list. However, the effectiveness of the locality drops with each additional column.

Important to note that you need statistics collected on columns that you Z-order by else data skipping won't take effect. Thus its important to reorder the table such that the Z-order by column(s) are in one of the first 32 columns or change the dataSkippingNumIndexedCols property

And if you learn best through visuals this is a great explainer video on Z-ordering on Delta Tables

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.