cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How is Z-ORDER different from bucketing in Hive?

Anonymous
Not applicable
 
1 ACCEPTED SOLUTION

Accepted Solutions

User16783855117
Contributor II

Bucketing is essentially running a single column through a hashing function to place them into a 'bucket'. So for example, I have a customer ID, and I have 2 buckets configured for the Customer ID column:

Customer 1 goes to Bucket 1

Customer 2 goes to Bucket 2

Customer 3 goes to Bucket 1

Customer 4 goes to Bucket 2

This bucketing strategy is very helpful when you have big table to big table joins. Let's expand the example with Customers in one table, and Sales organized in another table. Both tables should be bucketed based on the join key, Customer Id, into an equal amount of buckets. Now when I try to join Customer 1 with all of their transactions, I know that Customer 1 will be in Bucket 1, for both the Customer and Sales tables. This means I will only need to join values from Bucket 1 for both of these tables in order to join the data, and I can skip trying to join the values from Bucket 1 to Bucket 2 for both tables.

Z ORDER-ing is another technique for 'Data Clustering' (where similar data elements are co-located together) but it provides additional functions on top of bucketing. Z ORDER indexes can be created using multiple columns, but the index efficiency drops off as you add more columns. Z ORDER-ing does not require you to specify the number of buckets that I want to bucket a column into, it just works automatically. The same benefits listed above for bucket to bucket joins apply for Z ORDER-ed indexes as well.

This Z ORDER index is also extracted and tracked inside of the Delta Log, and is used during query planning. So when querying the dataset based on a Z ORDER-ed column, the Z ORDER index can be used to efficiently seek directly to the right data elements, reducing query time.

A challenge of bucketing is that it is prone to creating small files. A 'bucket' in the Hive implementation is a subdirectory, and you will need a subdirectory for each bucket value inside of each partition . If I had 32 buckets, and a table partitioned by month, I could easily have 12 * 32 = 384 total directories, each with small files inside of them. Z ORDER-ing instead combines similar data elements into files, and attempts to right-size the files at the same time thanks to the OPTIMIZE command in Delta Lake. So instead of having tons of very small files broken up into 384 bucket folders, I have fewer files with more records inside of each file in the 12 folders, with the benefits of the Z ORDER index. These files are still created in their respective partitions though, so if you have a bad partitioning strategy you will still have small files.

There are further enhancements that Z ORDER-ing bring on top of some of the benefits listed above, but hopefully this is a good primer for comparing Z ORDERING and bucketing in Hive. ๐Ÿ™‚

View solution in original post

2 REPLIES 2

User16783855117
Contributor II

Bucketing is essentially running a single column through a hashing function to place them into a 'bucket'. So for example, I have a customer ID, and I have 2 buckets configured for the Customer ID column:

Customer 1 goes to Bucket 1

Customer 2 goes to Bucket 2

Customer 3 goes to Bucket 1

Customer 4 goes to Bucket 2

This bucketing strategy is very helpful when you have big table to big table joins. Let's expand the example with Customers in one table, and Sales organized in another table. Both tables should be bucketed based on the join key, Customer Id, into an equal amount of buckets. Now when I try to join Customer 1 with all of their transactions, I know that Customer 1 will be in Bucket 1, for both the Customer and Sales tables. This means I will only need to join values from Bucket 1 for both of these tables in order to join the data, and I can skip trying to join the values from Bucket 1 to Bucket 2 for both tables.

Z ORDER-ing is another technique for 'Data Clustering' (where similar data elements are co-located together) but it provides additional functions on top of bucketing. Z ORDER indexes can be created using multiple columns, but the index efficiency drops off as you add more columns. Z ORDER-ing does not require you to specify the number of buckets that I want to bucket a column into, it just works automatically. The same benefits listed above for bucket to bucket joins apply for Z ORDER-ed indexes as well.

This Z ORDER index is also extracted and tracked inside of the Delta Log, and is used during query planning. So when querying the dataset based on a Z ORDER-ed column, the Z ORDER index can be used to efficiently seek directly to the right data elements, reducing query time.

A challenge of bucketing is that it is prone to creating small files. A 'bucket' in the Hive implementation is a subdirectory, and you will need a subdirectory for each bucket value inside of each partition . If I had 32 buckets, and a table partitioned by month, I could easily have 12 * 32 = 384 total directories, each with small files inside of them. Z ORDER-ing instead combines similar data elements into files, and attempts to right-size the files at the same time thanks to the OPTIMIZE command in Delta Lake. So instead of having tons of very small files broken up into 384 bucket folders, I have fewer files with more records inside of each file in the 12 folders, with the benefits of the Z ORDER index. These files are still created in their respective partitions though, so if you have a bad partitioning strategy you will still have small files.

There are further enhancements that Z ORDER-ing bring on top of some of the benefits listed above, but hopefully this is a good primer for comparing Z ORDERING and bucketing in Hive. ๐Ÿ™‚

User16826994223
Honored Contributor III
Bucketing is physical partition of the the table but the Zordering is arrangement of records in a file , in most optimal manner
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.