Databricks Community

neerajaN · ‎03-03-2026

Hi, as per spark internals, once count function executed in worker nodes , one of the worker node collect all the count of records and do summation ?

or count of records from all worker nodes passed to driver node. and summation done driver node side. could some one please clear my doubt?

balajij8 · ‎03-04-2026

Worker nodes (Executors) - Each executor computes a partial count of records within its partitions locally.
Driver node - All partial counts aresent to the driver, which performs the final summation and returns the result. Count is an action. Spark keeps the aggregation distributed until the final work is done.

View solution in original post

Ashwin_DSA · ‎03-04-2026

Hi @neerajaN,

When you run a count() action in Spark, the summation happens in a two-step process to ensure efficiency and prevent network bottlenecks:

Local Aggregation (Executor Level): Each Executor (worker node) calculates a partial count for the data partitions on which it resides. It doesn't send individual records... it just computes a single integer representing the count for its local data.
Final Summation (Driver Level): The Executors send these small "partial count" integers to the Driver Node. The Driver then performs the final summation of these partial results to return the total count to the user.

In short... the Driver node is responsible for the final aggregation of the partial totals, but the heavy lifting of counting millions of rows is done in parallel across the workers.

Reference links as requested:

Databricks - Apache Spark Architecture: Databricks Concepts: Driver vs. Workers
Apache Spark Documentation - Monitoring: Spark Web UI - Stages & Tasks
Spark SQL Internals (Deep Dive): Mastering Spark SQL - Basic Actions

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

balajij8 · ‎03-04-2026

Worker nodes (Executors) - Each executor computes a partial count of records within its partitions locally.
Driver node - All partial counts aresent to the driver, which performs the final summation and returns the result. Count is an action. Spark keeps the aggregation distributed until the final work is done.

neerajaN · ‎03-04-2026

Thank you for the explanation. please share the relevant documents related to this.

Ashwin_DSA · ‎03-04-2026

Hi @neerajaN,

When you run a count() action in Spark, the summation happens in a two-step process to ensure efficiency and prevent network bottlenecks:

Local Aggregation (Executor Level): Each Executor (worker node) calculates a partial count for the data partitions on which it resides. It doesn't send individual records... it just computes a single integer representing the count for its local data.
Final Summation (Driver Level): The Executors send these small "partial count" integers to the Driver Node. The Driver then performs the final summation of these partial results to return the total count to the user.

In short... the Driver node is responsible for the final aggregation of the partial totals, but the heavy lifting of counting millions of rows is done in parallel across the workers.

Reference links as requested:

Databricks - Apache Spark Architecture: Databricks Concepts: Driver vs. Workers
Apache Spark Documentation - Monitoring: Spark Web UI - Stages & Tasks
Spark SQL Internals (Deep Dive): Mastering Spark SQL - Basic Actions

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

SteveOstrowski · ‎03-08-2026

Hi @neerajaN,

You are correct that the count() operation follows a two-phase aggregation pattern in Spark. Here is how it works in detail:

PHASE 1: PARTIAL AGGREGATION (EXECUTORS)

Each executor computes a local partial count for the partitions assigned to it. This happens entirely in memory on the worker nodes. Spark generates a "partial_count" plan under the hood, so each executor produces a single integer per partition, not individual row data.

You can verify this by looking at the physical plan. For example:

df = spark.read.table("my_catalog.my_schema.my_table")
df.select("id").groupBy().count().explain(True)

In the physical plan output, you will see a HashAggregate node with "partial_count" followed by a shuffle (Exchange) and then a final HashAggregate with "count". That shuffle step is where partial results move between nodes.

PHASE 2: FINAL AGGREGATION (DRIVER)

After the partial counts are computed, Spark performs a shuffle to collect results. For a simple df.count() (which is an action), the partial integers are sent to the driver, which sums them to produce the final result.

For grouped aggregations like df.groupBy("col").count(), the partial counts are shuffled across executors (not directly to the driver) based on the grouping key. A second set of executors then computes the final count per group. The driver only collects the final result set.

KEY DISTINCTION

- For simple df.count() (no groupBy): partial counts go to the driver for final summation. Since each executor sends just one integer, this is very lightweight.
- For df.groupBy("col").count(): Spark uses a two-stage HashAggregate with a shuffle exchange between executors. The driver collects only the final grouped result.

HOW TO OBSERVE THIS

You can see exactly what happens in the Spark UI:

1. Run your count() query in a Databricks notebook.
2. Click on the Spark Jobs link in the cell output.
3. Look at the Stages tab to see the map (partial aggregation) and reduce (final aggregation) stages.
4. The SQL/DataFrame tab shows the query plan with HashAggregate nodes.

REFERENCES

- Databricks documentation on Spark concepts and cluster architecture:

https://docs.databricks.com/en/getting-started/concepts.html

- Apache Spark documentation on monitoring and the Spark UI:

https://spark.apache.org/docs/latest/monitoring.html

- Apache Spark SQL performance tuning (covers aggregation strategies):

https://spark.apache.org/docs/latest/sql-performance-tuning.html

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.

Databricks Community

count function

DAIS 2026 Speaker Spotlight Series #19 | Erin Butler

Solution Accelerator Series | Large Language Models (LLMs) for Customer Service Analytics

🌟 Community Pulse: Your Weekly Roundup! June 01 – 07, 2026

FREE TRAINING: Databricks Business Impact Accelerator

FLASH SALE: Save 50% on Summit Training ⚡