cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Results from the spark application to driver

raghvendrarm1
New Contributor

I tried to read many articles but still not clear on this:

The executors complete the execution of tasks and have the results with them.

1. The results(output data) from all executors is transported to driver in all cases or executors persist it if that is to be done to a file storage etc.

2. If in call cases results are transported back to driver- how is that achieved- do we have any link to a document describing that in detail.

2 REPLIES 2

dkushari
Databricks Employee
Databricks Employee

Hi @raghvendrarm1 - Have a look at the section "Apache Sparkโ€™s Distributed Execution" in chapter 1 of Learning Spark, 2nd Edition (https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/ch01.html). Have a look at teh picture -

dkushari_0-1760818756989.png

 

K_Anudeep
Databricks Employee
Databricks Employee

Hello @raghvendrarm1  ,

 

Below are the answers to your questions:

Do executors always send โ€œresultsโ€ to the driver?

  • No. Only actions that return values (e.g., collect, take, first, count) bring data back to the driver. collect explicitly โ€œreturns all the records โ€ฆ as a listโ€ in the driverโ€™s memory and is bounded by spark.driver.maxResultSize. In contrast, writes (df.writeโ€ฆ, INSERT, SAVE) are performed by executors to storage, with the driver just coordinating commits. Shuffles move data between executors, not through the driver. 

How is it done under the hood (at a high level)?

  • Return-to-driver actions: Each task serialises its partitionโ€™s result; the driver gathers them (subject to spark.driver.maxResultSize). Thatโ€™s why a large collect() can OOM the driver. 
  •  Sparkโ€™s Data Source API V2 has executors write partition outputs and send small commit messages back; the driver finalizes the job commit.