cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

What is Z-ordering in Delta and what are some best practices on using it?

aladda
Honored Contributor II
Honored Contributor II
1 ACCEPTED SOLUTION

Accepted Solutions

aladda
Honored Contributor II
Honored Contributor II

Z-Ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms to dramatically reduce the amount of data that needs to be read. Syntax for Z-ordering can be found here.

If you expect a column to be commonly used in query predicates and if that column has high cardinality (that is, a large number of distinct values) which might make it ineffective for PARTITIONing the table by, then use ZORDER BY instead (ex:- a table containing companies, dates where you might want to partition by company and z-order by date assuming that table collects data for several years)

You can specify multiple columns for ZORDER BY as a comma-separated list. However, the effectiveness of the locality drops with each additional column.

Important to note that you need statistics collected on columns that you Z-order by else data skipping won't take effect. Thus its important to reorder the table such that the Z-order by column(s) are in one of the first 32 columns or change the dataSkippingNumIndexedCols property

And if you learn best through visuals this is a great explainer video on Z-ordering on Delta Tables

View solution in original post

2 REPLIES 2

User16826994223
Honored Contributor III

NiCely Written

aladda
Honored Contributor II
Honored Contributor II

Z-Ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms to dramatically reduce the amount of data that needs to be read. Syntax for Z-ordering can be found here.

If you expect a column to be commonly used in query predicates and if that column has high cardinality (that is, a large number of distinct values) which might make it ineffective for PARTITIONing the table by, then use ZORDER BY instead (ex:- a table containing companies, dates where you might want to partition by company and z-order by date assuming that table collects data for several years)

You can specify multiple columns for ZORDER BY as a comma-separated list. However, the effectiveness of the locality drops with each additional column.

Important to note that you need statistics collected on columns that you Z-order by else data skipping won't take effect. Thus its important to reorder the table such that the Z-order by column(s) are in one of the first 32 columns or change the dataSkippingNumIndexedCols property

And if you learn best through visuals this is a great explainer video on Z-ordering on Delta Tables