cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How is Z-ORDER different from bucketing in Hive?

Anonymous
Not applicable
 
1 ACCEPTED SOLUTION

Accepted Solutions

User16783855117
Contributor II

Bucketing is essentially running a single column through a hashing function to place them into a 'bucket'. So for example, I have a customer ID, and I have 2 buckets configured for the Customer ID column:

Customer 1 goes to Bucket 1

Customer 2 goes to Bucket 2

Customer 3 goes to Bucket 1

Customer 4 goes to Bucket 2

This bucketing strategy is very helpful when you have big table to big table joins. Let's expand the example with Customers in one table, and Sales organized in another table. Both tables should be bucketed based on the join key, Customer Id, into an equal amount of buckets. Now when I try to join Customer 1 with all of their transactions, I know that Customer 1 will be in Bucket 1, for both the Customer and Sales tables. This means I will only need to join values from Bucket 1 for both of these tables in order to join the data, and I can skip trying to join the values from Bucket 1 to Bucket 2 for both tables.

Z ORDER-ing is another technique for 'Data Clustering' (where similar data elements are co-located together) but it provides additional functions on top of bucketing. Z ORDER indexes can be created using multiple columns, but the index efficiency drops off as you add more columns. Z ORDER-ing does not require you to specify the number of buckets that I want to bucket a column into, it just works automatically. The same benefits listed above for bucket to bucket joins apply for Z ORDER-ed indexes as well.

This Z ORDER index is also extracted and tracked inside of the Delta Log, and is used during query planning. So when querying the dataset based on a Z ORDER-ed column, the Z ORDER index can be used to efficiently seek directly to the right data elements, reducing query time.

A challenge of bucketing is that it is prone to creating small files. A 'bucket' in the Hive implementation is a subdirectory, and you will need a subdirectory for each bucket value inside of each partition . If I had 32 buckets, and a table partitioned by month, I could easily have 12 * 32 = 384 total directories, each with small files inside of them. Z ORDER-ing instead combines similar data elements into files, and attempts to right-size the files at the same time thanks to the OPTIMIZE command in Delta Lake. So instead of having tons of very small files broken up into 384 bucket folders, I have fewer files with more records inside of each file in the 12 folders, with the benefits of the Z ORDER index. These files are still created in their respective partitions though, so if you have a bad partitioning strategy you will still have small files.

There are further enhancements that Z ORDER-ing bring on top of some of the benefits listed above, but hopefully this is a good primer for comparing Z ORDERING and bucketing in Hive. 🙂

View solution in original post

2 REPLIES 2

User16783855117
Contributor II

Bucketing is essentially running a single column through a hashing function to place them into a 'bucket'. So for example, I have a customer ID, and I have 2 buckets configured for the Customer ID column:

Customer 1 goes to Bucket 1

Customer 2 goes to Bucket 2

Customer 3 goes to Bucket 1

Customer 4 goes to Bucket 2

This bucketing strategy is very helpful when you have big table to big table joins. Let's expand the example with Customers in one table, and Sales organized in another table. Both tables should be bucketed based on the join key, Customer Id, into an equal amount of buckets. Now when I try to join Customer 1 with all of their transactions, I know that Customer 1 will be in Bucket 1, for both the Customer and Sales tables. This means I will only need to join values from Bucket 1 for both of these tables in order to join the data, and I can skip trying to join the values from Bucket 1 to Bucket 2 for both tables.

Z ORDER-ing is another technique for 'Data Clustering' (where similar data elements are co-located together) but it provides additional functions on top of bucketing. Z ORDER indexes can be created using multiple columns, but the index efficiency drops off as you add more columns. Z ORDER-ing does not require you to specify the number of buckets that I want to bucket a column into, it just works automatically. The same benefits listed above for bucket to bucket joins apply for Z ORDER-ed indexes as well.

This Z ORDER index is also extracted and tracked inside of the Delta Log, and is used during query planning. So when querying the dataset based on a Z ORDER-ed column, the Z ORDER index can be used to efficiently seek directly to the right data elements, reducing query time.

A challenge of bucketing is that it is prone to creating small files. A 'bucket' in the Hive implementation is a subdirectory, and you will need a subdirectory for each bucket value inside of each partition . If I had 32 buckets, and a table partitioned by month, I could easily have 12 * 32 = 384 total directories, each with small files inside of them. Z ORDER-ing instead combines similar data elements into files, and attempts to right-size the files at the same time thanks to the OPTIMIZE command in Delta Lake. So instead of having tons of very small files broken up into 384 bucket folders, I have fewer files with more records inside of each file in the 12 folders, with the benefits of the Z ORDER index. These files are still created in their respective partitions though, so if you have a bad partitioning strategy you will still have small files.

There are further enhancements that Z ORDER-ing bring on top of some of the benefits listed above, but hopefully this is a good primer for comparing Z ORDERING and bucketing in Hive. 🙂

User16826994223
Honored Contributor III
Bucketing is physical partition of the the table but the Zordering is arrangement of records in a file , in most optimal manner
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!