Re: Job , Task, Stage Creation

Advika · ‎03-27-2025

When you execute df = spark.read.json('xyz.json’), Spark does not read the file immediately. Data is only read when an action like count() is triggered.

Job: df.count() triggers one job because it's an action.
Stage: Reading JSON and counting don't require data shuffling, so Spark optimises them into a single stage.
Task: Number of tasks depends on how the data is partitioned. Here, it depends on the JSON mode (by default, spark expects JSON files to be in single-line mode): Single-line mode (Spark will create multiple tasks if the file is larger than the default partition size) and Multi-line mode (the entire file is treated as a single JSON object, making it non-splittable, so only 1 task runs).