Databricks Community

subham0611 · ‎05-13-2024

Hi Databricks team,

I am trying to understand internals of spark coalesce code(DefaultPartitionCoalescer) and going through spark code for this. While I understood coalesce function but I am not sure about complete flow of code like where its get called and how coalescedRDD gets passed to executor. If you can provide a sample flow it would be great.

def coalesce(maxPartitions: Int, prev: RDD[_]): Array[PartitionGroup] = {
  val partitionLocs = new PartitionLocations(prev)
  // setup the groups (bins)
  setupGroups(math.min(prev.partitions.length, maxPartitions), partitionLocs)
  // assign partitions (balls) to each group (bins)
  throwBalls(maxPartitions, prev, balanceSlack, partitionLocs)
  getPartitions
}

I wanted to understand the code flow. Which service internally calls this function and how coalesced partitions get distributed acorss executors etc.

raphaelblg · ‎05-21-2024

Hello @subham0611 ,

The coalesce operation triggered from user code can be initiated from either an RDD or a Dataset, with each having distinct codepaths:

Both the RDD and Dataset classes contain a coalesce function.

The coalescing logic is relatively straightforward:

The driver node determines the Spark plan for the coalesce operation. When using the Dataset API, this operation results in a narrow dependency. For instance, if you reduce the number of partitions from 1000 to 100, there will not be a shuffle. Instead, each of the 100 new partitions will claim 10 of the current partitions.

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

View solution in original post

raphaelblg · ‎05-21-2024

Hello @subham0611 ,

The coalesce operation triggered from user code can be initiated from either an RDD or a Dataset, with each having distinct codepaths:

Both the RDD and Dataset classes contain a coalesce function.

The coalescing logic is relatively straightforward:

The driver node determines the Spark plan for the coalesce operation. When using the Dataset API, this operation results in a narrow dependency. For instance, if you reduce the number of partitions from 1000 to 100, there will not be a shuffle. Instead, each of the 100 new partitions will claim 10 of the current partitions.

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

Databricks Community

How does coalesce works internally

Connect with Databricks Users in Your Area

Introducing an exclusively Databricks-hosted Assistant

How to present and share your Notebook insights in AI/BI Dashboards

Meet the Databricks MVPs

Now Hiring: Databricks Community Technical Moderator

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs