How does hash shuffle join work in Spark?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-25-2023 10:57 AM
Hi All, I am trying to understand the internals shuffle hash join. I want to check if my understanding of it is correct. Let’s say I have two tables t1 and t2 joined on column country (8 distinct values). If I set the number of shuffle partitions as 4 with two executors. In this case, data from t1 on both the executors is first split into 4 partitions (let’s say part 0 - part 3)/files (stored in disk or memory as an intermediate step) using a hash of key % 4, and the same is done with data from t2 across two executors. In the reduce phase, data from the same partitions are merged which finally results in 4 partitions (eg: part 0 data from t1 and t2 from both the executors is merged into one big part 0 ) before performing the join. Is my understanding of it correct? Thanks for the help!
- Labels:
-
Distinct Values
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-08-2023 08:17 PM
@Vinay Emmadi : In Spark, a hash shuffle join is a type of join that is used when joining two data sets on a common key. The data is first partitioned based on the join key, and then each partition is shuffled and sent to a node in the cluster. The shuffle data is then sorted and merged with the other data sets with the same join key.
Here's a step-by-step explanation of how hash shuffle join works in Spark:
- Partitioning: The two data sets that are being joined are partitioned based on their join key using the HashPartitioner. Each partition will have all the records with the same join key.
- Shuffle: The partitions are shuffled across the nodes in the cluster using the network. Each node receives one or more partitions.
- Local join: Each node performs a local join on the received partitions. The join operation compares the join keys of the two data sets and produces a new data set that includes the combined data from both sets that have the same join key.
- Merge: The local join results are merged together by a reduce-like operation called a shuffle reduce. The shuffle reduce operation merges the local join results for each join key into a single result.
- Output: The final result of the join is written to disk or returned to the application.
Hope this helps you!