cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Parallel Processing of Databricks Notebook

HariharaSam
Contributor

I have a scenario where I need to run same databricks notebook multiple times in parallel.

What is the best approach to do this ?

4 REPLIES 4

Leszek
Contributor

Hi @Hariharan Sambath​,

I'm using concurrent.futures module in Python. Great explanation is in this post:

https://transform365.blog/2020/06/21/run-same-databricks-notebook-for-multiple-times-in-parallel-con...

I cannot say that below scenario is the best but works in my production env very well 🙂

Hi @Leszek​ ,

After going through the link that you shared and exploring further I found that it is best suited for I/O operations.

But mine is CPU bound operations where lot of computations takes place and one more thing is that I need to run my notebook at least 3500 times with 3500 input values and that is my requirement..

for example consider my input is list of 3500 different values and I have a notebook called NotebookA and I need to run the notebook with the values in the list..

Running the notebook in sequence one by one will consume lot of time and that's why I am looking for parallel processing.

jakubk
Contributor

I'm using the following custom python function i found online as a stopgap. Your notebook needs have parameters so you can pass in different runtime values for each iteration. The tasklist is just a list of parameters for each iteration

# code downloaded from internet
 # enables running notebooks in parallel
 from concurrent.futures import ThreadPoolExecutor
 class NotebookData:
   def __init__(self, path, timeout, parameters=None, retry=0):
     self.path = path
     self.timeout = timeout
     self.parameters = parameters
     self.retry = retry
 def submitNotebook(notebook):
   print("Running notebook {} with params: {}".format(notebook.path, notebook.parameters))
   try:
     if (notebook.parameters):
       return dbutils.notebook.run(notebook.path, notebook.timeout, notebook.parameters)
     else:
       return dbutils.notebook.run(notebook.path, notebook.timeout)
   except Exception:
     print("FAILED: notebook {} with params: {}".format(notebook.path, notebook.parameters))
     if notebook.retry < 1:
       raise
     print("Retrying notebook {} with params: {}".format(notebook.path, notebook.parameters))
     notebook.retry = notebook.retry - 1
     submitNotebook(notebook)
 def parallelNotebooks(notebooks, numInParallel):
   # If you create too many notebooks in parallel the driver may crash when you submit all of the jobs at once. 
   # This code limits the number of parallel notebooks.
   with ThreadPoolExecutor(max_workers=numInParallel) as ec:
     return [ec.submit(submitNotebook, notebook) for notebook in notebooks]
 
 
#get list of data to process from the config "db"
 taskList = spark.sql("select * FROM Config.DeltaJobsToRun").toPandas().to_dict('records')
 
 
# create a queue of notebooks based on the metadata above. Pass into the generic param notebook and execute
# the dbutils.notebook.run() command that's used doesnt bubble up errors properly so finding the cause means clicking thru all the 'notebook job' links right now. TODO: investigate logging wrapper which will put the info into a good location
notebooks=[]
for n in taskList:
    notebooks.append(NotebookData("./deltaLoadParam", 0, n, 1))
 
res = parallelNotebooks(notebooks, 2)
result = [f.result(timeout=3600) for f in res] # This is a blocking call.
print(result)

good luck with finding the specific iteration of a run which failed with the above approach though - the dbutils.notebook.run function does not return any valuable info nor bubble up exceptions

I plan on removing this functionality from databricks entirely and orchestrate it somewhere else

To do that create a workflow job for that notebook

then use an external orchestration tool/script which uses the databricks jobs api to kick off the jobs in parallel with the correct parameters

Anonymous
Not applicable

Hi @Hariharan Sambath​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group