I faced some challenges in my projects related to parallel processing in Databricks. In many cases, the issue was not the volume of data itself, but the overall execution time. I was processing a relatively small number of objects, but each object required separate notebook execution, and the orchestration became the bottleneck.
Current approach
I have a notebook that builds an array of configurations, and in the final step I trigger parallel execution of a general notebook responsible for loading data (for example, Bronze layer ingestion).
Sometimes I process 10 objects, sometimes 60.
My initial solution was based on pool.map():
from multiprocessing.pool import ThreadPool
pool = ThreadPool(32)
pool.map(run_notebook_bronze, lst)
Where:
- run_notebook_bronze() uses dbutils.notebook.run()
- lst is an array of object configurations
This works quite well for smaller workloads, but I noticed that:
- the driver is utilized close to 100%
- executors are often only at 10% utilization
This clearly shows that the driver becomes the bottleneck, and performance drops significantly for larger workloads.
Three options I found for parallel processing
1. Python multiprocessing / ThreadPool
from multiprocessing.pool import ThreadPool
pool = ThreadPool(32)
pool.map(run_notebook_bronze, lst)
Pros
- very easy to implement
- fast for small workloads
- simple local testing
Cons
- driver-heavy solution
- poor scalability for larger workloads
- limited executor utilization
- dbutils.notebook.run() overhead becomes significant
2. Creating Jobs dynamically using Databricks Jobs API
from databricks.sdk import WorkspaceClient
from databricks.sdk.service import jobs
w = WorkspaceClient()
cluster_id = spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
tasks = [
jobs.SubmitTask(
existing_cluster_id=cluster_id,
notebook_task=jobs.NotebookTask(
notebook_path=notebook_path,
base_parameters={...}
),
task_key=f"bronze-{obj['code']}-{obj['name']}",
)
for obj in obj_lst
]
run = w.jobs.submit(
run_name="bronze_parallel_run",
tasks=tasks
)
Pros
- much better workload distribution
- tasks are managed by Databricks Jobs (information from documentation)
Cons
- don't know
3. Databricks Workflows — For Each task
Build this section in notebook:
dbutils.jobs.taskValues.set(
key="params",
value=json.dumps(param_list)
)
Then I use these parameters inside a For Each task in Databricks Workflows.
Pros
- probably the best resource and memory management
- native Databricks orchestration
- excellent observability — all runs visible in one place
- production-friendly solution
Cons
- difficult to test manually
- requires running the full Job each time
- cluster startup time can be frustrating during development
My observations
So far:
- multiprocessing is the fastest to implement
- but it performs poorly for heavier workloads because the driver becomes the bottleneck
For Each seems to be the strongest long-term solution, especially for production environments.
Jobs API looks promising as well, but I would like to better understand real production experiences before fully adopting it.
My question to the community
What is your experience with these approaches?
- Which option works best for production-scale parallel notebook execution?
- How do you handle testing for For Each workflows without waiting for cluster startup every time?
- Do you use other patterns for parallel processing in Databricks that provide better memory management and executor utilization?
I would love to hear your perspective, best practices, and lessons learned.