Databricks Community

alex307 · a month ago

Hi everyone,

I'm using a ThreadPoolExecutor in Databricks to run multiple notebooks at the same time. The problem is that it seems like all the processing happens on the driver node, while the executor nodes are idle. This causes the driver to run out of memory and eventually crash.

Each notebook performs the same steps - creating large pandas and spark dataframes, then adding them to delta tables. I'm not sure what I'm missing here. How can I ensure that the workload is properly distributed across the executor nodes, rather than putting all the pressure on the driver?

I am waiting your reply. Thanks in Advance!

mmayorga · a month ago

Greetings @alex307 and thank you for sending your question.

When using ThreadPoolExecutor to run multiple notebooks concurrently in Databricks, the workload is being executed on the driver node rather than distributed across Spark executors. This results in high driver memory consumption and can lead to out-of-memory errors or crashes, especially when manipulating large pandas and Spark DataFrames or writing to delta tables.

Why This Happens

ThreadPoolExecutor is a Python-level parallelization tool, so its operations take place on the driver node only. The main Spark executors remain idle unless specifically utilized via Spark actions. Additionally, conventional pandas operations are not distributed—they are limited to driver's memory and compute resources. This restricts cluster scalability and leads to instability if the driver’s resources are exceeded.

Recommended Workarounds and Considerations

Migrate pandas workloads to pandas API on Spark: Use the pandas API on Spark (import pyspark.pandas as ps) to run familiar pandas-style code at scale. This approach distributes workloads efficiently across cluster executors, dramatically improving resource utilization and scalability.
Favor Spark DataFrame operations: Where possible, process large datasets with Spark DataFrames and Spark SQL. These are designed for distributed computation and keep memory use balanced among all cluster nodes.
Limit driver-side processing: Avoid collecting large datasets into driver memory (for example, don’t use .collect() or .toPandas() on massive DataFrames). Use .limit() and preview data subsets when necessary.
Increase driver resources if necessary: For workflows that require intensive driver-side processing, consider resizing your driver node, but note this is less scalable and may incur higher costs.
Optimize cluster configuration: Monitor driver memory in the Spark UI and adjust cluster settings. Periodically restart clusters to clear stale objects and memory, and run heavy jobs on dedicated clusters.
Review official performance recommendations: Follow Databricks best practices to maximize cluster stability and efficiency.

References

I hope this helps and if it does, please Accept it as solution, thank you!.

View solution in original post

mmayorga · a month ago

Greetings @alex307 and thank you for sending your question.

When using ThreadPoolExecutor to run multiple notebooks concurrently in Databricks, the workload is being executed on the driver node rather than distributed across Spark executors. This results in high driver memory consumption and can lead to out-of-memory errors or crashes, especially when manipulating large pandas and Spark DataFrames or writing to delta tables.

Why This Happens

ThreadPoolExecutor is a Python-level parallelization tool, so its operations take place on the driver node only. The main Spark executors remain idle unless specifically utilized via Spark actions. Additionally, conventional pandas operations are not distributed—they are limited to driver's memory and compute resources. This restricts cluster scalability and leads to instability if the driver’s resources are exceeded.

Recommended Workarounds and Considerations

Migrate pandas workloads to pandas API on Spark: Use the pandas API on Spark (import pyspark.pandas as ps) to run familiar pandas-style code at scale. This approach distributes workloads efficiently across cluster executors, dramatically improving resource utilization and scalability.
Favor Spark DataFrame operations: Where possible, process large datasets with Spark DataFrames and Spark SQL. These are designed for distributed computation and keep memory use balanced among all cluster nodes.
Limit driver-side processing: Avoid collecting large datasets into driver memory (for example, don’t use .collect() or .toPandas() on massive DataFrames). Use .limit() and preview data subsets when necessary.
Increase driver resources if necessary: For workflows that require intensive driver-side processing, consider resizing your driver node, but note this is less scalable and may incur higher costs.
Optimize cluster configuration: Monitor driver memory in the Spark UI and adjust cluster settings. Periodically restart clusters to clear stale objects and memory, and run heavy jobs on dedicated clusters.
Review official performance recommendations: Follow Databricks best practices to maximize cluster stability and efficiency.