Databricks Community

pawelmitrus · ‎07-24-2022

I have a Delta table spark101.airlines (sourced from `/databricks-datasets/airlines/`) partitioned by `Year`. My `spark.sql.shuffle.partitions` is set to default 200. I run a simple query:

select Origin, count(*) 
from spark101.airlines
group by Origin

Stage 1: Data is read into 17 partitions, which resembles my `spark.sql.files.maxPartitionBytes`. This stage also pre-aggregates the data within the scope of each executor and saves it into 200 partitions.

What I would expect:

Stage 2: It should spawn 200 tasks to read and aggregate partitions from the previous stage.

What I've god instead:

All the other stages adds up to 200, but why there are separate jobs spawned?

-werners- · ‎07-25-2022

jobs get spawned on actions.

So it seems you have multiple actions in your code.

Is the code snippet you posted the whole notebook?

pawelmitrus · ‎07-31-2022

I think it is something that Databricks does when running a query which result is returned to the notebook. When I write this sql statement to the storage, then it's only 1 job with 2 stages - as expected.