Databricks Community

Anonymous · ‎06-04-2021

User16783855117 · ‎06-07-2021

Bucketing is essentially running a single column through a hashing function to place them into a 'bucket'. So for example, I have a customer ID, and I have 2 buckets configured for the Customer ID column:

Customer 1 goes to Bucket 1

Customer 2 goes to Bucket 2

Customer 3 goes to Bucket 1

Customer 4 goes to Bucket 2

This bucketing strategy is very helpful when you have big table to big table joins. Let's expand the example with Customers in one table, and Sales organized in another table. Both tables should be bucketed based on the join key, Customer Id, into an equal amount of buckets. Now when I try to join Customer 1 with all of their transactions, I know that Customer 1 will be in Bucket 1, for both the Customer and Sales tables. This means I will only need to join values from Bucket 1 for both of these tables in order to join the data, and I can skip trying to join the values from Bucket 1 to Bucket 2 for both tables.

Z ORDER-ing is another technique for 'Data Clustering' (where similar data elements are co-located together) but it provides additional functions on top of bucketing. Z ORDER indexes can be created using multiple columns, but the index efficiency drops off as you add more columns. Z ORDER-ing does not require you to specify the number of buckets that I want to bucket a column into, it just works automatically. The same benefits listed above for bucket to bucket joins apply for Z ORDER-ed indexes as well.

This Z ORDER index is also extracted and tracked inside of the Delta Log, and is used during query planning. So when querying the dataset based on a Z ORDER-ed column, the Z ORDER index can be used to efficiently seek directly to the right data elements, reducing query time.

A challenge of bucketing is that it is prone to creating small files. A 'bucket' in the Hive implementation is a subdirectory, and you will need a subdirectory for each bucket value inside of each partition . If I had 32 buckets, and a table partitioned by month, I could easily have 12 * 32 = 384 total directories, each with small files inside of them. Z ORDER-ing instead combines similar data elements into files, and attempts to right-size the files at the same time thanks to the OPTIMIZE command in Delta Lake. So instead of having tons of very small files broken up into 384 bucket folders, I have fewer files with more records inside of each file in the 12 folders, with the benefits of the Z ORDER index. These files are still created in their respective partitions though, so if you have a bad partitioning strategy you will still have small files.

There are further enhancements that Z ORDER-ing bring on top of some of the benefits listed above, but hopefully this is a good primer for comparing Z ORDERING and bucketing in Hive. 🙂

View solution in original post

User16783855117 · ‎06-07-2021

Bucketing is essentially running a single column through a hashing function to place them into a 'bucket'. So for example, I have a customer ID, and I have 2 buckets configured for the Customer ID column:

Customer 1 goes to Bucket 1

Customer 2 goes to Bucket 2

Customer 3 goes to Bucket 1

Customer 4 goes to Bucket 2

This bucketing strategy is very helpful when you have big table to big table joins. Let's expand the example with Customers in one table, and Sales organized in another table. Both tables should be bucketed based on the join key, Customer Id, into an equal amount of buckets. Now when I try to join Customer 1 with all of their transactions, I know that Customer 1 will be in Bucket 1, for both the Customer and Sales tables. This means I will only need to join values from Bucket 1 for both of these tables in order to join the data, and I can skip trying to join the values from Bucket 1 to Bucket 2 for both tables.

Z ORDER-ing is another technique for 'Data Clustering' (where similar data elements are co-located together) but it provides additional functions on top of bucketing. Z ORDER indexes can be created using multiple columns, but the index efficiency drops off as you add more columns. Z ORDER-ing does not require you to specify the number of buckets that I want to bucket a column into, it just works automatically. The same benefits listed above for bucket to bucket joins apply for Z ORDER-ed indexes as well.

This Z ORDER index is also extracted and tracked inside of the Delta Log, and is used during query planning. So when querying the dataset based on a Z ORDER-ed column, the Z ORDER index can be used to efficiently seek directly to the right data elements, reducing query time.

A challenge of bucketing is that it is prone to creating small files. A 'bucket' in the Hive implementation is a subdirectory, and you will need a subdirectory for each bucket value inside of each partition . If I had 32 buckets, and a table partitioned by month, I could easily have 12 * 32 = 384 total directories, each with small files inside of them. Z ORDER-ing instead combines similar data elements into files, and attempts to right-size the files at the same time thanks to the OPTIMIZE command in Delta Lake. So instead of having tons of very small files broken up into 384 bucket folders, I have fewer files with more records inside of each file in the 12 folders, with the benefits of the Z ORDER index. These files are still created in their respective partitions though, so if you have a bad partitioning strategy you will still have small files.

There are further enhancements that Z ORDER-ing bring on top of some of the benefits listed above, but hopefully this is a good primer for comparing Z ORDERING and bucketing in Hive. 🙂