I have a Delta table spark101.airlines (sourced from `/databricks-datasets/airlines/`) partitioned by `Year`. My `spark.sql.shuffle.partitions` is set to default 200. I run a simple query:
select Origin, count(*)
from spark101.airlines
group by Origin
Stage 1: Data is read into 17 partitions, which resembles my `spark.sql.files.maxPartitionBytes`. This stage also pre-aggregates the data within the scope of each executor and saves it into 200 partitions.
What I would expect:
Stage 2: It should spawn 200 tasks to read and aggregate partitions from the previous stage.
What I've god instead:
All the other stages adds up to 200, but why there are separate jobs spawned?