Job , Task, Stage Creation

Rajt1 — Wed, 26 Mar 2025 19:01:32 GMT

I am running below code -
df = spark.read.json('xyz.json')
df.count

I want to understand the actual working of the spark. How many jobs & stages will be created. I want to understand the detailed & easier concept of how it works?

Re: Job , Task, Stage Creation

Advika — Thu, 27 Mar 2025 11:55:26 GMT

Hello @Rajt1!

When you execute df = spark.read.json('xyz.json’), Spark does not read the file immediately. Data is only read when an action like count() is triggered.

Job: df.count() triggers one job because it's an action.
Stage: Reading JSON and counting don't require data shuffling, so Spark optimises them into a single stage.
Task: Number of tasks depends on how the data is partitioned. Here, it depends on the JSON mode (by default, spark expects JSON files to be in single-line mode): Single-line mode (Spark will create multiple tasks if the file is larger than the default partition size) and Multi-line mode (the entire file is treated as a single JSON object, making it non-splittable, so only 1 task runs).

topic Job , Task, Stage Creation in Data Engineering

Job , Task, Stage Creation

Re: Job , Task, Stage Creation