Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
I am running below code - df = spark.read.json('xyz.json') df.count
I want to understand the actual working of the spark. How many jobs & stages will be created. I want to understand the detailed & easier concept of how it works?
When you execute df = spark.read.json('xyz.jsonโ), Spark does not read the file immediately. Data is only read when an action like count() is triggered.
Job: df.count() triggers one job because it's an action.
Stage: Reading JSON and counting don't require data shuffling, so Spark optimises them into a single stage.
Task: Number of tasks depends on how the data is partitioned. Here, it depends on the JSON mode (by default, spark expects JSON files to be in single-line mode): Single-line mode (Spark will create multiple tasks if the file is larger than the default partition size) and Multi-line mode (the entire file is treated as a single JSON object, making it non-splittable, so only 1 task runs).
Join Us as a Local Community Builder!
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!