Greetings @alex307 and thank you for sending your question.
When using ThreadPoolExecutor to run multiple notebooks concurrently in Databricks, the workload is being executed on the driver node rather than distributed across Spark executors. This results in high driver memory consumption and can lead to out-of-memory errors or crashes, especially when manipulating large pandas and Spark DataFrames or writing to delta tables.
Why This Happens
ThreadPoolExecutor is a Python-level parallelization tool, so its operations take place on the driver node only. The main Spark executors remain idle unless specifically utilized via Spark actions. Additionally, conventional pandas operations are not distributed—they are limited to driver's memory and compute resources. This restricts cluster scalability and leads to instability if the driver’s resources are exceeded.
Recommended Workarounds and Considerations
-
Migrate pandas workloads to pandas API on Spark: Use the pandas API on Spark (import pyspark.pandas as ps) to run familiar pandas-style code at scale. This approach distributes workloads efficiently across cluster executors, dramatically improving resource utilization and scalability.
-
Favor Spark DataFrame operations: Where possible, process large datasets with Spark DataFrames and Spark SQL. These are designed for distributed computation and keep memory use balanced among all cluster nodes.
-
Limit driver-side processing: Avoid collecting large datasets into driver memory (for example, don’t use .collect() or .toPandas() on massive DataFrames). Use .limit() and preview data subsets when necessary.
-
Increase driver resources if necessary: For workflows that require intensive driver-side processing, consider resizing your driver node, but note this is less scalable and may incur higher costs.
-
Optimize cluster configuration: Monitor driver memory in the Spark UI and adjust cluster settings. Periodically restart clusters to clear stale objects and memory, and run heavy jobs on dedicated clusters.
-
Review official performance recommendations: Follow Databricks best practices to maximize cluster stability and efficiency.
References
I hope this helps and if it does, please Accept it as solution, thank you!.