Databricks Community

Rajt1 · ‎03-26-2025

I am running below code -
df = spark.read.json('xyz.json')
df.count

I want to understand the actual working of the spark. How many jobs & stages will be created. I want to understand the detailed & easier concept of how it works?

Advika · ‎03-27-2025

Hello @Rajt1!

When you execute df = spark.read.json('xyz.json’), Spark does not read the file immediately. Data is only read when an action like count() is triggered.

Job: df.count() triggers one job because it's an action.
Stage: Reading JSON and counting don't require data shuffling, so Spark optimises them into a single stage.
Task: Number of tasks depends on how the data is partitioned. Here, it depends on the JSON mode (by default, spark expects JSON files to be in single-line mode): Single-line mode (Spark will create multiple tasks if the file is larger than the default partition size) and Multi-line mode (the entire file is treated as a single JSON object, making it non-splittable, so only 1 task runs).

Databricks Community

Job , Task, Stage Creation

Join Us as a Local Community Builder!

Find Sensitive Data at Scale with Data Classification in Unity Catalog

Solution Accelerator Series | #6 - Adverse Drug Event Detection

Announcing Backfill Runs in Lakeflow Jobs for Higher Quality Downstream Data

🚀 New: Databricks Interactive Architecture Design Workshops

Databricks DevConnect I Washington D.C.