In your case:
Jobs 1–3: Likely related to schema inference and metadata collection (each with ~200 tasks because Spark is scanning many small files in parallel).
Job 4: A trivial job (1 task) — often Spark creates a small stage for driver‑side operations.
Job 5: The actual data read and collect, with 32 tasks (default parallelism or based on cluster cores).
So many tasks for this data --- Because spark creates one input partition per file split. With 1000 small CSVs, Spark initially sees 1000 splits. It then groups them into tasks based on cluster parallelism and scheduling. That’s why you see ~200 tasks in early jobs. Even though the total size is only 30 MB, Spark doesn’t merge them into a single partition automatically — it treats each file as a separate input split.
To avoid it - Combine small files before ingestion or use repartition/coalesce.
RG #Driving Business Outcomes with Data Intelligence