Re: Best option for parallel processing

AdrianLobacz · ‎04-26-2026

I faced some challenges in my projects related to parallel processing in Databricks. In many cases, the issue was not the volume of data itself, but the overall execution time. I was processing a relatively small number of objects, but each object required separate notebook execution, and the orchestration became the bottleneck.

Current approach

I have a notebook that builds an array of configurations, and in the final step I trigger parallel execution of a general notebook responsible for loading data (for example, Bronze layer ingestion).

Sometimes I process 10 objects, sometimes 60.

My initial solution was based on pool.map():

from multiprocessing.pool import ThreadPool

pool = ThreadPool(32)
pool.map(run_notebook_bronze, lst)

Where:

run_notebook_bronze() uses dbutils.notebook.run()
lst is an array of object configurations

This works quite well for smaller workloads, but I noticed that:

the driver is utilized close to 100%
executors are often only at 10% utilization

This clearly shows that the driver becomes the bottleneck, and performance drops significantly for larger workloads.

Three options I found for parallel processing

1. Python multiprocessing / ThreadPool

from multiprocessing.pool import ThreadPool

pool = ThreadPool(32)
pool.map(run_notebook_bronze, lst)

Pros

very easy to implement
fast for small workloads
simple local testing

Cons

driver-heavy solution
poor scalability for larger workloads
limited executor utilization
dbutils.notebook.run() overhead becomes significant

2. Creating Jobs dynamically using Databricks Jobs API

from databricks.sdk import WorkspaceClient
from databricks.sdk.service import jobs

w = WorkspaceClient()
cluster_id = spark.conf.get("spark.databricks.clusterUsageTags.clusterId")

tasks = [
jobs.SubmitTask(
existing_cluster_id=cluster_id,
notebook_task=jobs.NotebookTask(
notebook_path=notebook_path,
base_parameters={...}
),
task_key=f"bronze-{obj['code']}-{obj['name']}",
)
for obj in obj_lst
]

run = w.jobs.submit(
run_name="bronze_parallel_run",
tasks=tasks
)

Pros

much better workload distribution
tasks are managed by Databricks Jobs (information from documentation)

Cons

- don't know

3. Databricks Workflows — For Each task

Build this section in notebook:

dbutils.jobs.taskValues.set(
key="params",
value=json.dumps(param_list)
)

Then I use these parameters inside a For Each task in Databricks Workflows.

Pros

probably the best resource and memory management
native Databricks orchestration
excellent observability — all runs visible in one place
production-friendly solution

Cons

difficult to test manually
requires running the full Job each time
cluster startup time can be frustrating during development

My observations

So far:

multiprocessing is the fastest to implement
but it performs poorly for heavier workloads because the driver becomes the bottleneck

For Each seems to be the strongest long-term solution, especially for production environments.

Jobs API looks promising as well, but I would like to better understand real production experiences before fully adopting it.

My question to the community

What is your experience with these approaches?

Which option works best for production-scale parallel notebook execution?
How do you handle testing for For Each workflows without waiting for cluster startup every time?
Do you use other patterns for parallel processing in Databricks that provide better memory management and executor utilization?

I would love to hear your perspective, best practices, and lessons learned.

balajij8 · ‎04-27-2026

The Driver was the bottleneck in the Thread Pool approach. By moving to Serverless Workflows, you can shift the orchestration weight to the Databricks Control Plane.

Eliminate Driver Saturation: Serverless compute for Workflows natively handles task distribution. Databricks provisions the necessary resources for each iteration of the objects automatically.
For Each Task with Near Instant Scaling: Unlike classic clusters that take minutes to resize, Serverless Performance Optimized mode starts in seconds and uses warm pools to scale concurrent "For Each" iterations (concurrently) without fighting for shared Driver CPU.
Cost Isolation: Each ingestion is billed for its specific execution time. You avoid paying for an oversized Driver or idle Executors between spikes.