<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How does coalesce works internally in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/how-does-coalesce-works-internally/m-p/68917#M7273</link>
    <description>&lt;P&gt;Hi Databricks team,&lt;/P&gt;&lt;P&gt;I am trying to understand internals of spark coalesce code(DefaultPartitionCoalescer) and going through spark code for this. While I understood coalesce function but I am not sure about complete flow of code like where its get called and how coalescedRDD gets passed to executor. If you can provide a sample flow it would be great.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;PRE&gt;&lt;SPAN&gt;def &lt;/SPAN&gt;&lt;SPAN&gt;coalesce&lt;/SPAN&gt;(maxPartitions: Int, prev: RDD[_]): Array[PartitionGroup] = {&lt;BR /&gt;  &lt;SPAN&gt;val &lt;/SPAN&gt;partitionLocs = &lt;SPAN&gt;new &lt;/SPAN&gt;PartitionLocations(prev)&lt;BR /&gt;  &lt;SPAN&gt;// setup the groups (bins)&lt;BR /&gt;&lt;/SPAN&gt;  setupGroups(math.&lt;SPAN&gt;min&lt;/SPAN&gt;(prev.partitions.length, maxPartitions), partitionLocs)&lt;BR /&gt;  &lt;SPAN&gt;// assign partitions (balls) to each group (bins)&lt;BR /&gt;&lt;/SPAN&gt;  throwBalls(maxPartitions, prev, balanceSlack, partitionLocs)&lt;BR /&gt;  getPartitions&lt;BR /&gt;}&lt;/PRE&gt;&lt;/DIV&gt;&lt;P&gt;I wanted to understand the code flow. Which service internally calls this function and how coalesced partitions get distributed acorss executors etc.&lt;/P&gt;</description>
    <pubDate>Mon, 13 May 2024 18:01:01 GMT</pubDate>
    <dc:creator>subham0611</dc:creator>
    <dc:date>2024-05-13T18:01:01Z</dc:date>
    <item>
      <title>How does coalesce works internally</title>
      <link>https://community.databricks.com/t5/get-started-discussions/how-does-coalesce-works-internally/m-p/68917#M7273</link>
      <description>&lt;P&gt;Hi Databricks team,&lt;/P&gt;&lt;P&gt;I am trying to understand internals of spark coalesce code(DefaultPartitionCoalescer) and going through spark code for this. While I understood coalesce function but I am not sure about complete flow of code like where its get called and how coalescedRDD gets passed to executor. If you can provide a sample flow it would be great.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;PRE&gt;&lt;SPAN&gt;def &lt;/SPAN&gt;&lt;SPAN&gt;coalesce&lt;/SPAN&gt;(maxPartitions: Int, prev: RDD[_]): Array[PartitionGroup] = {&lt;BR /&gt;  &lt;SPAN&gt;val &lt;/SPAN&gt;partitionLocs = &lt;SPAN&gt;new &lt;/SPAN&gt;PartitionLocations(prev)&lt;BR /&gt;  &lt;SPAN&gt;// setup the groups (bins)&lt;BR /&gt;&lt;/SPAN&gt;  setupGroups(math.&lt;SPAN&gt;min&lt;/SPAN&gt;(prev.partitions.length, maxPartitions), partitionLocs)&lt;BR /&gt;  &lt;SPAN&gt;// assign partitions (balls) to each group (bins)&lt;BR /&gt;&lt;/SPAN&gt;  throwBalls(maxPartitions, prev, balanceSlack, partitionLocs)&lt;BR /&gt;  getPartitions&lt;BR /&gt;}&lt;/PRE&gt;&lt;/DIV&gt;&lt;P&gt;I wanted to understand the code flow. Which service internally calls this function and how coalesced partitions get distributed acorss executors etc.&lt;/P&gt;</description>
      <pubDate>Mon, 13 May 2024 18:01:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/how-does-coalesce-works-internally/m-p/68917#M7273</guid>
      <dc:creator>subham0611</dc:creator>
      <dc:date>2024-05-13T18:01:01Z</dc:date>
    </item>
    <item>
      <title>Re: How does coalesce works internally</title>
      <link>https://community.databricks.com/t5/get-started-discussions/how-does-coalesce-works-internally/m-p/70166#M7274</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Hello &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/92421"&gt;@subham0611&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;The coalesce operation triggered from user code can be initiated from either an RDD or a Dataset, with each having distinct codepaths:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;RDD: &lt;BR /&gt;&lt;A href="https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala" target="_blank" rel="noopener"&gt;https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;Dataset: &lt;A href="https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala" target="_blank" rel="noopener"&gt;https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Both the RDD and Dataset classes contain a coalesce function.&lt;/P&gt;
&lt;P&gt;The coalescing logic is relatively straightforward:&lt;/P&gt;
&lt;P&gt;The driver node determines the Spark plan for the coalesce operation. When using the Dataset API, this operation results in a narrow dependency. For instance, if you reduce the number of partitions from 1000 to 100, there will not be a shuffle. Instead, each of the 100 new partitions will claim 10 of the current partitions.&lt;/P&gt;</description>
      <pubDate>Tue, 21 May 2024 17:58:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/how-does-coalesce-works-internally/m-p/70166#M7274</guid>
      <dc:creator>raphaelblg</dc:creator>
      <dc:date>2024-05-21T17:58:30Z</dc:date>
    </item>
  </channel>
</rss>

