yesterday
I am trying to read 1000 small csv files each 30 kb size which are stored in databricks volume.
Below is the query i am doing:
df=spark.read.csv.options(header=true).load('/path')
df.collect()
Why is it creating 5 jobs? Why 1-3 jobs have 200 tasks,4 has 1 and 5 has 32 tasks? Moreover, total data size is 1000*30kb=30mb, then why it is not creating a single partition?
Please help?
yesterday
In your case:
Jobs 1–3: Likely related to schema inference and metadata collection (each with ~200 tasks because Spark is scanning many small files in parallel).
Job 4: A trivial job (1 task) — often Spark creates a small stage for driver‑side operations.
Job 5: The actual data read and collect, with 32 tasks (default parallelism or based on cluster cores).
So many tasks for this data --- Because spark creates one input partition per file split. With 1000 small CSVs, Spark initially sees 1000 splits. It then groups them into tasks based on cluster parallelism and scheduling. That’s why you see ~200 tasks in early jobs. Even though the total size is only 30 MB, Spark doesn’t merge them into a single partition automatically — it treats each file as a separate input split.
To avoid it - Combine small files before ingestion or use repartition/coalesce.
yesterday
@mordex - yes, Spark caps the parallelism for file listing at 200 tasks, regardless of whether you have 1,000 or 10,000 files. it is controlled by spark.sql.sources.parallelPartitionDiscovery.parallelism.
yesterday
In your case:
Jobs 1–3: Likely related to schema inference and metadata collection (each with ~200 tasks because Spark is scanning many small files in parallel).
Job 4: A trivial job (1 task) — often Spark creates a small stage for driver‑side operations.
Job 5: The actual data read and collect, with 32 tasks (default parallelism or based on cluster cores).
So many tasks for this data --- Because spark creates one input partition per file split. With 1000 small CSVs, Spark initially sees 1000 splits. It then groups them into tasks based on cluster parallelism and scheduling. That’s why you see ~200 tasks in early jobs. Even though the total size is only 30 MB, Spark doesn’t merge them into a single partition automatically — it treats each file as a separate input split.
To avoid it - Combine small files before ingestion or use repartition/coalesce.
yesterday
Hi Raman, thankyou so much for such detailed explanation. It answers almost every question.
I had one last doubt, right now I just created 10,000 files and spark still created 200 tasks only when listing leaf files? Is this the default number or is there any config for it?
yesterday
@mordex - yes, Spark caps the parallelism for file listing at 200 tasks, regardless of whether you have 1,000 or 10,000 files. it is controlled by spark.sql.sources.parallelPartitionDiscovery.parallelism.
yesterday
Thank you again for the kind feedback! I truly appreciate it.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now