Best option for parallel processing

AdrianLobacz
Databricks Partner

I faced some challenges in my projects related to parallel processing in Databricks. In many cases, the issue was not the volume of data itself, but the overall execution time. I was processing a relatively small number of objects, but each object required separate notebook execution, and the orchestration became the bottleneck.

Current approach

I have a notebook that builds an array of configurations, and in the final step I trigger parallel execution of a general notebook responsible for loading data (for example, Bronze layer ingestion).

Sometimes I process 10 objects, sometimes 60.

My initial solution was based on pool.map():

 
from multiprocessing.pool import ThreadPool

pool = ThreadPool(32)
pool.map(run_notebook_bronze, lst)
 

Where:

  • run_notebook_bronze() uses dbutils.notebook.run()
  • lst is an array of object configurations

This works quite well for smaller workloads, but I noticed that:

  • the driver is utilized close to 100%
  • executors are often only at 10% utilization

This clearly shows that the driver becomes the bottleneck, and performance drops significantly for larger workloads.


Three options I found for parallel processing

1. Python multiprocessing / ThreadPool

 

 
from multiprocessing.pool import ThreadPool

pool = ThreadPool(32)
pool.map(run_notebook_bronze, lst)

Pros

  • very easy to implement
  • fast for small workloads
  • simple local testing

Cons

  • driver-heavy solution
  • poor scalability for larger workloads
  • limited executor utilization
  • dbutils.notebook.run() overhead becomes significant

2. Creating Jobs dynamically using Databricks Jobs API

 

 
from databricks.sdk import WorkspaceClient
from databricks.sdk.service import jobs

w = WorkspaceClient()
cluster_id = spark.conf.get("spark.databricks.clusterUsageTags.clusterId")

tasks = [
jobs.SubmitTask(
existing_cluster_id=cluster_id,
notebook_task=jobs.NotebookTask(
notebook_path=notebook_path,
base_parameters={...}
),
task_key=f"bronze-{obj['code']}-{obj['name']}",
)
for obj in obj_lst
]

run = w.jobs.submit(
run_name="bronze_parallel_run",
tasks=tasks
)

Pros

  • much better workload distribution
  • tasks are managed by Databricks Jobs (information from documentation)

Cons

- don't know


3. Databricks Workflows — For Each task

 

Build this section in notebook:
 
dbutils.jobs.taskValues.set(
key="params",
value=json.dumps(param_list)
)

Then I use these parameters inside a For Each task in Databricks Workflows.

Pros

  • probably the best resource and memory management
  • native Databricks orchestration
  • excellent observability — all runs visible in one place
  • production-friendly solution

Cons

  • difficult to test manually
  • requires running the full Job each time
  • cluster startup time can be frustrating during development

My observations

So far:

  • multiprocessing is the fastest to implement
  • but it performs poorly for heavier workloads because the driver becomes the bottleneck

For Each seems to be the strongest long-term solution, especially for production environments.

Jobs API looks promising as well, but I would like to better understand real production experiences before fully adopting it.


My question to the community

What is your experience with these approaches?

  • Which option works best for production-scale parallel notebook execution?
  • How do you handle testing for For Each workflows without waiting for cluster startup every time?
  • Do you use other patterns for parallel processing in Databricks that provide better memory management and executor utilization?

I would love to hear your perspective, best practices, and lessons learned.