cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Why is spark creating 5 jobs and 200 tasks?

mordex
Visitor

I am trying to read 1000 small csv files each 30 kb size which are stored in databricks volume. 

Below is the query i am doing:

df=spark.read.csv.options(header=true).load('/path')

df.collect()

 

Why is it creating 5 jobs? Why 1-3 jobs have 200 tasks,4 has 1 and 5 has 32 tasks? Moreover, total data size is 1000*30kb=30mb, then why it is not creating a single partition?

Please help? 

 

030a9798-9c6f-4ab3-be53-7f6e4a5f7289.jfif

1 ACCEPTED SOLUTION

Accepted Solutions

Raman_Unifeye
Contributor III

 

In your case:

Jobs 1–3: Likely related to schema inference and metadata collection (each with ~200 tasks because Spark is scanning many small files in parallel).

Job 4: A trivial job (1 task) — often Spark creates a small stage for driver‑side operations.

Job 5: The actual data read and collect, with 32 tasks (default parallelism or based on cluster cores).

So many tasks for this data --- Because spark creates one input partition per file split. With 1000 small CSVs, Spark initially sees 1000 splits. It then groups them into tasks based on cluster parallelism and scheduling. That’s why you see ~200 tasks in early jobs. Even though the total size is only 30 MB, Spark doesn’t merge them into a single partition automatically — it treats each file as a separate input split.

To avoid it - Combine small files before ingestion or use repartition/coalesce.

 


RG #Driving Business Outcomes with Data Intelligence

View solution in original post

3 REPLIES 3

Raman_Unifeye
Contributor III

 

In your case:

Jobs 1–3: Likely related to schema inference and metadata collection (each with ~200 tasks because Spark is scanning many small files in parallel).

Job 4: A trivial job (1 task) — often Spark creates a small stage for driver‑side operations.

Job 5: The actual data read and collect, with 32 tasks (default parallelism or based on cluster cores).

So many tasks for this data --- Because spark creates one input partition per file split. With 1000 small CSVs, Spark initially sees 1000 splits. It then groups them into tasks based on cluster parallelism and scheduling. That’s why you see ~200 tasks in early jobs. Even though the total size is only 30 MB, Spark doesn’t merge them into a single partition automatically — it treats each file as a separate input split.

To avoid it - Combine small files before ingestion or use repartition/coalesce.

 


RG #Driving Business Outcomes with Data Intelligence

Hi Raman, thankyou so much for such detailed explanation. It answers almost every question. 

I had one last doubt, right now I just created 10,000 files and spark still created 200 tasks only when listing leaf files? Is this the default number or is there any config for it?

Raman_Unifeye
Contributor III

@mordex - yes, Spark caps the parallelism for file listing at 200 tasks, regardless of whether you have 1,000 or 10,000 files. it is controlled by spark.sql.sources.parallelPartitionDiscovery.parallelism. 

Run below command to get value of it.
 
spark.conf.get('spark.sql.sources.parallelPartitionDiscovery.parallelism')
--200

RG #Driving Business Outcomes with Data Intelligence