Re: How does hash shuffle join work in Spark?

VinayEmmadi · ‎01-25-2023

Hi All, I am trying to understand the internals shuffle hash join. I want to check if my understanding of it is correct. Let’s say I have two tables t1 and t2 joined on column country (8 distinct values). If I set the number of shuffle partitions as 4 with two executors. In this case, data from t1 on both the executors is first split into 4 partitions (let’s say part 0 - part 3)/files (stored in disk or memory as an intermediate step) using a hash of key % 4, and the same is done with data from t2 across two executors. In the reduce phase, data from the same partitions are merged which finally results in 4 partitions (eg: part 0 data from t1 and t2 from both the executors is merged into one big part 0 ) before performing the join. Is my understanding of it correct? Thanks for the help!

Anonymous · ‎03-08-2023

@Vinay Emmadi : In Spark, a hash shuffle join is a type of join that is used when joining two data sets on a common key. The data is first partitioned based on the join key, and then each partition is shuffled and sent to a node in the cluster. The shuffle data is then sorted and merged with the other data sets with the same join key.

Here's a step-by-step explanation of how hash shuffle join works in Spark:

Partitioning: The two data sets that are being joined are partitioned based on their join key using the HashPartitioner. Each partition will have all the records with the same join key.
Shuffle: The partitions are shuffled across the nodes in the cluster using the network. Each node receives one or more partitions.
Local join: Each node performs a local join on the received partitions. The join operation compares the join keys of the two data sets and produces a new data set that includes the combined data from both sets that have the same join key.
Merge: The local join results are merged together by a reduce-like operation called a shuffle reduce. The shuffle reduce operation merges the local join results for each join key into a single result.
Output: The final result of the join is written to disk or returned to the application.

Hope this helps you!