Databricks Community

neerajaN · ‎03-04-2026

hi , i am running the below query in databricks , first job5 created with 10 partitions .

and again job6 started where actual processing started.

in job5 is it identifying schema , when schema check will be done for the new dataset. is it checked by driver or any one of the executor before actual process starts.?

Ashwin_DSA · ‎03-04-2026

Hi @neerajaN,

You are right. Job 5 is Schema Inference job. You can identify Job 5 as a schema/header inference job because it triggers immediately upon spark.read. Since header=True is set without a manual .schema(), Spark must launch a job to look at the file headers before it can define the DataFrame. The subsequent Job 6 is the actual count() action of processing the full data.

To give you more confidence... and to verify this is a schema inference job for sure, check the SQL tab in the Spark UI for Job 5. You'll see a FileScan operation that finishes almost instantly. Because header=True is set without a manual schema, Spark triggers this job to resolve column names before it can even define the DataFrame.

To your question about who does the work, it will be the Driver who coordinates, but the Executors perform the actual read. Spark launches a small job where executors read the first few bytes/rows of the files to determine the schema. Once the schema is known, your action df.count() triggers the actual distributed processing of the entire dataset.

It's the same concept I explained in response to one of your other posts.

To eliminate Job 5 and speed up your pipeline, you can provide a manual schema. This allows Spark to remain 100% lazy (wait to execute until absolutely necessary) until the count is called.

References:

Databricks Documentation: Optimization: Providing a Schema
Apache Spark Guide: Spark SQL Data Sources - CSV

Note: Look for the section on samplingRatio and inferSchema which explains the performance trade-offs of schema discovery.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post