topic How does coalesce works internally in Get Started Discussions

How does coalesce works internally

subham0611 — Mon, 13 May 2024 18:01:01 GMT

Hi Databricks team,

I am trying to understand internals of spark coalesce code(DefaultPartitionCoalescer) and going through spark code for this. While I understood coalesce function but I am not sure about complete flow of code like where its get called and how coalescedRDD gets passed to executor. If you can provide a sample flow it would be great.

def coalesce(maxPartitions: Int, prev: RDD[_]): Array[PartitionGroup] = {
  val partitionLocs = new PartitionLocations(prev)
  // setup the groups (bins)
  setupGroups(math.min(prev.partitions.length, maxPartitions), partitionLocs)
  // assign partitions (balls) to each group (bins)
  throwBalls(maxPartitions, prev, balanceSlack, partitionLocs)
  getPartitions
}

I wanted to understand the code flow. Which service internally calls this function and how coalesced partitions get distributed acorss executors etc.

Re: How does coalesce works internally

raphaelblg — Tue, 21 May 2024 17:58:30 GMT

Hello @subham0611 ,

The coalesce operation triggered from user code can be initiated from either an RDD or a Dataset, with each having distinct codepaths:

Both the RDD and Dataset classes contain a coalesce function.

The coalescing logic is relatively straightforward:

The driver node determines the Spark plan for the coalesce operation. When using the Dataset API, this operation results in a narrow dependency. For instance, if you reduce the number of partitions from 1000 to 100, there will not be a shuffle. Instead, each of the 100 new partitions will claim 10 of the current partitions.