Job , Task, Stage Creation
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wednesday
I am running below code -
df = spark.read.json('xyz.json')
df.count
I want to understand the actual working of the spark. How many jobs & stages will be created. I want to understand the detailed & easier concept of how it works?
1 REPLY 1
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thursday - last edited Thursday
Hello @Rajt1!
When you execute df = spark.read.json('xyz.json’), Spark does not read the file immediately. Data is only read when an action like count() is triggered.
- Job: df.count() triggers one job because it's an action.
- Stage: Reading JSON and counting don't require data shuffling, so Spark optimises them into a single stage.
- Task: Number of tasks depends on how the data is partitioned. Here, it depends on the JSON mode (by default, spark expects JSON files to be in single-line mode): Single-line mode (Spark will create multiple tasks if the file is larger than the default partition size) and Multi-line mode (the entire file is treated as a single JSON object, making it non-splittable, so only 1 task runs).

