Hi Databricks team,
I am trying to understand internals of spark coalesce code(DefaultPartitionCoalescer) and going through spark code for this. While I understood coalesce function but I am not sure about complete flow of code like where its get called and how coalescedRDD gets passed to executor. If you can provide a sample flow it would be great.
def coalesce(maxPartitions: Int, prev: RDD[_]): Array[PartitionGroup] = {
val partitionLocs = new PartitionLocations(prev)
// setup the groups (bins)
setupGroups(math.min(prev.partitions.length, maxPartitions), partitionLocs)
// assign partitions (balls) to each group (bins)
throwBalls(maxPartitions, prev, balanceSlack, partitionLocs)
getPartitions
}
I wanted to understand the code flow. Which service internally calls this function and how coalesced partitions get distributed acorss executors etc.