cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How is Z-ORDER different from bucketing in Hive?

Anonymous
Not applicable
 
1 ACCEPTED SOLUTION

Accepted Solutions

User16783855117
Contributor II

Bucketing is essentially running a single column through a hashing function to place them into a 'bucket'. So for example, I have a customer ID, and I have 2 buckets configured for the Customer ID column:

Customer 1 goes to Bucket 1

Customer 2 goes to Bucket 2

Customer 3 goes to Bucket 1

Customer 4 goes to Bucket 2

This bucketing strategy is very helpful when you have big table to big table joins. Let's expand the example with Customers in one table, and Sales organized in another table. Both tables should be bucketed based on the join key, Customer Id, into an equal amount of buckets. Now when I try to join Customer 1 with all of their transactions, I know that Customer 1 will be in Bucket 1, for both the Customer and Sales tables. This means I will only need to join values from Bucket 1 for both of these tables in order to join the data, and I can skip trying to join the values from Bucket 1 to Bucket 2 for both tables.

Z ORDER-ing is another technique for 'Data Clustering' (where similar data elements are co-located together) but it provides additional functions on top of bucketing. Z ORDER indexes can be created using multiple columns, but the index efficiency drops off as you add more columns. Z ORDER-ing does not require you to specify the number of buckets that I want to bucket a column into, it just works automatically. The same benefits listed above for bucket to bucket joins apply for Z ORDER-ed indexes as well.

This Z ORDER index is also extracted and tracked inside of the Delta Log, and is used during query planning. So when querying the dataset based on a Z ORDER-ed column, the Z ORDER index can be used to efficiently seek directly to the right data elements, reducing query time.

A challenge of bucketing is that it is prone to creating small files. A 'bucket' in the Hive implementation is a subdirectory, and you will need a subdirectory for each bucket value inside of each partition . If I had 32 buckets, and a table partitioned by month, I could easily have 12 * 32 = 384 total directories, each with small files inside of them. Z ORDER-ing instead combines similar data elements into files, and attempts to right-size the files at the same time thanks to the OPTIMIZE command in Delta Lake. So instead of having tons of very small files broken up into 384 bucket folders, I have fewer files with more records inside of each file in the 12 folders, with the benefits of the Z ORDER index. These files are still created in their respective partitions though, so if you have a bad partitioning strategy you will still have small files.

There are further enhancements that Z ORDER-ing bring on top of some of the benefits listed above, but hopefully this is a good primer for comparing Z ORDERING and bucketing in Hive. ๐Ÿ™‚

View solution in original post

2 REPLIES 2

User16783855117
Contributor II

Bucketing is essentially running a single column through a hashing function to place them into a 'bucket'. So for example, I have a customer ID, and I have 2 buckets configured for the Customer ID column:

Customer 1 goes to Bucket 1

Customer 2 goes to Bucket 2

Customer 3 goes to Bucket 1

Customer 4 goes to Bucket 2

This bucketing strategy is very helpful when you have big table to big table joins. Let's expand the example with Customers in one table, and Sales organized in another table. Both tables should be bucketed based on the join key, Customer Id, into an equal amount of buckets. Now when I try to join Customer 1 with all of their transactions, I know that Customer 1 will be in Bucket 1, for both the Customer and Sales tables. This means I will only need to join values from Bucket 1 for both of these tables in order to join the data, and I can skip trying to join the values from Bucket 1 to Bucket 2 for both tables.

Z ORDER-ing is another technique for 'Data Clustering' (where similar data elements are co-located together) but it provides additional functions on top of bucketing. Z ORDER indexes can be created using multiple columns, but the index efficiency drops off as you add more columns. Z ORDER-ing does not require you to specify the number of buckets that I want to bucket a column into, it just works automatically. The same benefits listed above for bucket to bucket joins apply for Z ORDER-ed indexes as well.

This Z ORDER index is also extracted and tracked inside of the Delta Log, and is used during query planning. So when querying the dataset based on a Z ORDER-ed column, the Z ORDER index can be used to efficiently seek directly to the right data elements, reducing query time.

A challenge of bucketing is that it is prone to creating small files. A 'bucket' in the Hive implementation is a subdirectory, and you will need a subdirectory for each bucket value inside of each partition . If I had 32 buckets, and a table partitioned by month, I could easily have 12 * 32 = 384 total directories, each with small files inside of them. Z ORDER-ing instead combines similar data elements into files, and attempts to right-size the files at the same time thanks to the OPTIMIZE command in Delta Lake. So instead of having tons of very small files broken up into 384 bucket folders, I have fewer files with more records inside of each file in the 12 folders, with the benefits of the Z ORDER index. These files are still created in their respective partitions though, so if you have a bad partitioning strategy you will still have small files.

There are further enhancements that Z ORDER-ing bring on top of some of the benefits listed above, but hopefully this is a good primer for comparing Z ORDERING and bucketing in Hive. ๐Ÿ™‚

User16826994223
Honored Contributor III
Bucketing is physical partition of the the table but the Zordering is arrangement of records in a file , in most optimal manner

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group