cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best option for parallel processing

AdrianLobacz
Databricks Partner

I faced some challenges in my projects related to parallel processing in Databricks. In many cases, the issue was not the volume of data itself, but the overall execution time. I was processing a relatively small number of objects, but each object required separate notebook execution, and the orchestration became the bottleneck.

Current approach

I have a notebook that builds an array of configurations, and in the final step I trigger parallel execution of a general notebook responsible for loading data (for example, Bronze layer ingestion).

Sometimes I process 10 objects, sometimes 60.

My initial solution was based on pool.map():

 
from multiprocessing.pool import ThreadPool

pool = ThreadPool(32)
pool.map(run_notebook_bronze, lst)
 

Where:

  • run_notebook_bronze() uses dbutils.notebook.run()
  • lst is an array of object configurations

This works quite well for smaller workloads, but I noticed that:

  • the driver is utilized close to 100%
  • executors are often only at 10% utilization

This clearly shows that the driver becomes the bottleneck, and performance drops significantly for larger workloads.


Three options I found for parallel processing

1. Python multiprocessing / ThreadPool

 

 
from multiprocessing.pool import ThreadPool

pool = ThreadPool(32)
pool.map(run_notebook_bronze, lst)

Pros

  • very easy to implement
  • fast for small workloads
  • simple local testing

Cons

  • driver-heavy solution
  • poor scalability for larger workloads
  • limited executor utilization
  • dbutils.notebook.run() overhead becomes significant

2. Creating Jobs dynamically using Databricks Jobs API

 

 
from databricks.sdk import WorkspaceClient
from databricks.sdk.service import jobs

w = WorkspaceClient()
cluster_id = spark.conf.get("spark.databricks.clusterUsageTags.clusterId")

tasks = [
jobs.SubmitTask(
existing_cluster_id=cluster_id,
notebook_task=jobs.NotebookTask(
notebook_path=notebook_path,
base_parameters={...}
),
task_key=f"bronze-{obj['code']}-{obj['name']}",
)
for obj in obj_lst
]

run = w.jobs.submit(
run_name="bronze_parallel_run",
tasks=tasks
)

Pros

  • much better workload distribution
  • tasks are managed by Databricks Jobs (information from documentation)

Cons

- don't know


3. Databricks Workflows — For Each task

 

Build this section in notebook:
 
dbutils.jobs.taskValues.set(
key="params",
value=json.dumps(param_list)
)

Then I use these parameters inside a For Each task in Databricks Workflows.

Pros

  • probably the best resource and memory management
  • native Databricks orchestration
  • excellent observability — all runs visible in one place
  • production-friendly solution

Cons

  • difficult to test manually
  • requires running the full Job each time
  • cluster startup time can be frustrating during development

My observations

So far:

  • multiprocessing is the fastest to implement
  • but it performs poorly for heavier workloads because the driver becomes the bottleneck

For Each seems to be the strongest long-term solution, especially for production environments.

Jobs API looks promising as well, but I would like to better understand real production experiences before fully adopting it.


My question to the community

What is your experience with these approaches?

  • Which option works best for production-scale parallel notebook execution?
  • How do you handle testing for For Each workflows without waiting for cluster startup every time?
  • Do you use other patterns for parallel processing in Databricks that provide better memory management and executor utilization?

I would love to hear your perspective, best practices, and lessons learned.

0 REPLIES 0