Best option for parallel processing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-26-2026 08:55 AM
I faced some challenges in my projects related to parallel processing in Databricks. In many cases, the issue was not the volume of data itself, but the overall execution time. I was processing a relatively small number of objects, but each object required separate notebook execution, and the orchestration became the bottleneck.
Current approach
I have a notebook that builds an array of configurations, and in the final step I trigger parallel execution of a general notebook responsible for loading data (for example, Bronze layer ingestion).
Sometimes I process 10 objects, sometimes 60.
My initial solution was based on pool.map():
pool = ThreadPool(32)
pool.map(run_notebook_bronze, lst)
Where:
- run_notebook_bronze() uses dbutils.notebook.run()
- lst is an array of object configurations
This works quite well for smaller workloads, but I noticed that:
- the driver is utilized close to 100%
- executors are often only at 10% utilization
This clearly shows that the driver becomes the bottleneck, and performance drops significantly for larger workloads.
Three options I found for parallel processing
1. Python multiprocessing / ThreadPool
pool = ThreadPool(32)
pool.map(run_notebook_bronze, lst)
Pros
- very easy to implement
- fast for small workloads
- simple local testing
Cons
- driver-heavy solution
- poor scalability for larger workloads
- limited executor utilization
- dbutils.notebook.run() overhead becomes significant
2. Creating Jobs dynamically using Databricks Jobs API
from databricks.sdk.service import jobs
w = WorkspaceClient()
cluster_id = spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
tasks = [
jobs.SubmitTask(
existing_cluster_id=cluster_id,
notebook_task=jobs.NotebookTask(
notebook_path=notebook_path,
base_parameters={...}
),
task_key=f"bronze-{obj['code']}-{obj['name']}",
)
for obj in obj_lst
]
run = w.jobs.submit(
run_name="bronze_parallel_run",
tasks=tasks
)
Pros
- much better workload distribution
- tasks are managed by Databricks Jobs (information from documentation)
Cons
- don't know
3. Databricks Workflows — For Each task
key="params",
value=json.dumps(param_list)
)
Then I use these parameters inside a For Each task in Databricks Workflows.
Pros
- probably the best resource and memory management
- native Databricks orchestration
- excellent observability — all runs visible in one place
- production-friendly solution
Cons
- difficult to test manually
- requires running the full Job each time
- cluster startup time can be frustrating during development
My observations
So far:
- multiprocessing is the fastest to implement
- but it performs poorly for heavier workloads because the driver becomes the bottleneck
For Each seems to be the strongest long-term solution, especially for production environments.
Jobs API looks promising as well, but I would like to better understand real production experiences before fully adopting it.
My question to the community
What is your experience with these approaches?
- Which option works best for production-scale parallel notebook execution?
- How do you handle testing for For Each workflows without waiting for cluster startup every time?
- Do you use other patterns for parallel processing in Databricks that provide better memory management and executor utilization?
I would love to hear your perspective, best practices, and lessons learned.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-27-2026 10:50 AM
The Driver was the bottleneck in the Thread Pool approach. By moving to Serverless Workflows, you can shift the orchestration weight to the Databricks Control Plane.
- Eliminate Driver Saturation: Serverless compute for Workflows natively handles task distribution. Databricks provisions the necessary resources for each iteration of the objects automatically.
- For Each Task with Near Instant Scaling: Unlike classic clusters that take minutes to resize, Serverless Performance Optimized mode starts in seconds and uses warm pools to scale concurrent "For Each" iterations (concurrently) without fighting for shared Driver CPU.
- Cost Isolation: Each ingestion is billed for its specific execution time. You avoid paying for an oversized Driver or idle Executors between spikes.