cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

How does coalesce works internally

subham0611
New Contributor II

Hi Databricks team,

I am trying to understand internals of spark coalesce code(DefaultPartitionCoalescer) and going through spark code for this. While I understood coalesce function but I am not sure about complete flow of code like where its get called and how coalescedRDD gets passed to executor. If you can provide a sample flow it would be great.

 

def coalesce(maxPartitions: Int, prev: RDD[_]): Array[PartitionGroup] = {
val partitionLocs = new PartitionLocations(prev)
// setup the groups (bins)
setupGroups(math.min(prev.partitions.length, maxPartitions), partitionLocs)
// assign partitions (balls) to each group (bins)
throwBalls(maxPartitions, prev, balanceSlack, partitionLocs)
getPartitions
}

I wanted to understand the code flow. Which service internally calls this function and how coalesced partitions get distributed acorss executors etc.

1 ACCEPTED SOLUTION

Accepted Solutions

raphaelblg
Honored Contributor II

 

Hello @subham0611 ,

The coalesce operation triggered from user code can be initiated from either an RDD or a Dataset, with each having distinct codepaths:

Both the RDD and Dataset classes contain a coalesce function.

The coalescing logic is relatively straightforward:

The driver node determines the Spark plan for the coalesce operation. When using the Dataset API, this operation results in a narrow dependency. For instance, if you reduce the number of partitions from 1000 to 100, there will not be a shuffle. Instead, each of the 100 new partitions will claim 10 of the current partitions.

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

View solution in original post

1 REPLY 1

raphaelblg
Honored Contributor II

 

Hello @subham0611 ,

The coalesce operation triggered from user code can be initiated from either an RDD or a Dataset, with each having distinct codepaths:

Both the RDD and Dataset classes contain a coalesce function.

The coalescing logic is relatively straightforward:

The driver node determines the Spark plan for the coalesce operation. When using the Dataset API, this operation results in a narrow dependency. For instance, if you reduce the number of partitions from 1000 to 100, there will not be a shuffle. Instead, each of the 100 new partitions will claim 10 of the current partitions.

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group