cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

How does coalesce works internally

subham0611
New Contributor II

Hi Databricks team,

I am trying to understand internals of spark coalesce code(DefaultPartitionCoalescer) and going through spark code for this. While I understood coalesce function but I am not sure about complete flow of code like where its get called and how coalescedRDD gets passed to executor. If you can provide a sample flow it would be great.

 

def coalesce(maxPartitions: Int, prev: RDD[_]): Array[PartitionGroup] = {
val partitionLocs = new PartitionLocations(prev)
// setup the groups (bins)
setupGroups(math.min(prev.partitions.length, maxPartitions), partitionLocs)
// assign partitions (balls) to each group (bins)
throwBalls(maxPartitions, prev, balanceSlack, partitionLocs)
getPartitions
}

I wanted to understand the code flow. Which service internally calls this function and how coalesced partitions get distributed acorss executors etc.

1 ACCEPTED SOLUTION

Accepted Solutions

raphaelblg
Databricks Employee
Databricks Employee

 

Hello @subham0611 ,

The coalesce operation triggered from user code can be initiated from either an RDD or a Dataset, with each having distinct codepaths:

Both the RDD and Dataset classes contain a coalesce function.

The coalescing logic is relatively straightforward:

The driver node determines the Spark plan for the coalesce operation. When using the Dataset API, this operation results in a narrow dependency. For instance, if you reduce the number of partitions from 1000 to 100, there will not be a shuffle. Instead, each of the 100 new partitions will claim 10 of the current partitions.

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

View solution in original post

1 REPLY 1

raphaelblg
Databricks Employee
Databricks Employee

 

Hello @subham0611 ,

The coalesce operation triggered from user code can be initiated from either an RDD or a Dataset, with each having distinct codepaths:

Both the RDD and Dataset classes contain a coalesce function.

The coalescing logic is relatively straightforward:

The driver node determines the Spark plan for the coalesce operation. When using the Dataset API, this operation results in a narrow dependency. For instance, if you reduce the number of partitions from 1000 to 100, there will not be a shuffle. Instead, each of the 100 new partitions will claim 10 of the current partitions.

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now