Hi @neerajaN,
You are right. Job 5 is Schema Inference job. You can identify Job 5 as a schema/header inference job because it triggers immediately upon spark.read. Since header=True is set without a manual .schema(), Spark must launch a job to look at the file headers before it can define the DataFrame. The subsequent Job 6 is the actual count() action of processing the full data.
To give you more confidence... and to verify this is a schema inference job for sure, check the SQL tab in the Spark UI for Job 5. You'll see a FileScan operation that finishes almost instantly. Because header=True is set without a manual schema, Spark triggers this job to resolve column names before it can even define the DataFrame.
To your question about who does the work, it will be the Driver who coordinates, but the Executors perform the actual read. Spark launches a small job where executors read the first few bytes/rows of the files to determine the schema. Once the schema is known, your action df.count() triggers the actual distributed processing of the entire dataset.
It's the same concept I explained in response to one of your other posts.
To eliminate Job 5 and speed up your pipeline, you can provide a manual schema. This allows Spark to remain 100% lazy (wait to execute until absolutely necessary) until the count is called.
References:
Note: Look for the section on samplingRatio and inferSchema which explains the performance trade-offs of schema discovery.
If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.
Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***