<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How does hash shuffle join work in Spark? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-does-hash-shuffle-join-work-in-spark/m-p/10672#M5811</link>
    <description>&lt;P&gt;Hi All, I am trying to understand the internals shuffle hash join. I want to check if my understanding of it is correct. Let’s say I have two tables t1 and t2 joined on column country (8 distinct values).&amp;nbsp;If I set the number of shuffle partitions as 4 with two executors. In this case, data from t1 on both the executors is first split into 4 partitions (let’s say part 0 - part 3)/files (stored in disk or memory as an intermediate step) using a hash of key % 4, and the same is done with data from t2 across two executors.&amp;nbsp;In the reduce phase,&amp;nbsp;data from the same partitions are merged which finally results in 4 partitions (eg: part 0 data from t1 and t2 from both the executors is merged into one big part 0 ) before performing the join. Is my understanding of it correct? Thanks for the help!&lt;/P&gt;</description>
    <pubDate>Wed, 25 Jan 2023 18:57:45 GMT</pubDate>
    <dc:creator>VinayEmmadi</dc:creator>
    <dc:date>2023-01-25T18:57:45Z</dc:date>
    <item>
      <title>How does hash shuffle join work in Spark?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-does-hash-shuffle-join-work-in-spark/m-p/10672#M5811</link>
      <description>&lt;P&gt;Hi All, I am trying to understand the internals shuffle hash join. I want to check if my understanding of it is correct. Let’s say I have two tables t1 and t2 joined on column country (8 distinct values).&amp;nbsp;If I set the number of shuffle partitions as 4 with two executors. In this case, data from t1 on both the executors is first split into 4 partitions (let’s say part 0 - part 3)/files (stored in disk or memory as an intermediate step) using a hash of key % 4, and the same is done with data from t2 across two executors.&amp;nbsp;In the reduce phase,&amp;nbsp;data from the same partitions are merged which finally results in 4 partitions (eg: part 0 data from t1 and t2 from both the executors is merged into one big part 0 ) before performing the join. Is my understanding of it correct? Thanks for the help!&lt;/P&gt;</description>
      <pubDate>Wed, 25 Jan 2023 18:57:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-does-hash-shuffle-join-work-in-spark/m-p/10672#M5811</guid>
      <dc:creator>VinayEmmadi</dc:creator>
      <dc:date>2023-01-25T18:57:45Z</dc:date>
    </item>
    <item>
      <title>Re: How does hash shuffle join work in Spark?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-does-hash-shuffle-join-work-in-spark/m-p/10673#M5812</link>
      <description>&lt;P&gt;@Vinay Emmadi​&amp;nbsp;: In Spark, a hash shuffle join is a type of join that is used when joining two data sets on a common key. The data is first partitioned based on the join key, and then each partition is shuffled and sent to a node in the cluster. The shuffle data is then sorted and merged with the other data sets with the same join key.&lt;/P&gt;&lt;P&gt;Here's a step-by-step explanation of how hash shuffle join works in Spark:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Partitioning: The two data sets that are being joined are partitioned based on their join key using the HashPartitioner. Each partition will have all the records with the same join key.&lt;/LI&gt;&lt;LI&gt;Shuffle: The partitions are shuffled across the nodes in the cluster using the network. Each node receives one or more partitions.&lt;/LI&gt;&lt;LI&gt;Local join: Each node performs a local join on the received partitions. The join operation compares the join keys of the two data sets and produces a new data set that includes the combined data from both sets that have the same join key.&lt;/LI&gt;&lt;LI&gt;Merge: The local join results are merged together by a reduce-like operation called a shuffle reduce. The shuffle reduce operation merges the local join results for each join key into a single result.&lt;/LI&gt;&lt;LI&gt;Output: The final result of the join is written to disk or returned to the application.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Hope this helps you!&lt;/P&gt;</description>
      <pubDate>Thu, 09 Mar 2023 04:17:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-does-hash-shuffle-join-work-in-spark/m-p/10673#M5812</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-03-09T04:17:39Z</dc:date>
    </item>
  </channel>
</rss>

