โ09-09-2022 05:51 AM
I have a scenario where I need to run same databricks notebook multiple times in parallel.
What is the best approach to do this ?
โ09-09-2022 06:57 AM
Hi @Hariharan Sambathโ,
I'm using concurrent.futures module in Python. Great explanation is in this post:
I cannot say that below scenario is the best but works in my production env very well ๐
โ09-09-2022 07:37 AM
Hi @Leszekโ ,
After going through the link that you shared and exploring further I found that it is best suited for I/O operations.
But mine is CPU bound operations where lot of computations takes place and one more thing is that I need to run my notebook at least 3500 times with 3500 input values and that is my requirement..
for example consider my input is list of 3500 different values and I have a notebook called NotebookA and I need to run the notebook with the values in the list..
Running the notebook in sequence one by one will consume lot of time and that's why I am looking for parallel processing.
โ09-13-2022 08:05 PM
I'm using the following custom python function i found online as a stopgap. Your notebook needs have parameters so you can pass in different runtime values for each iteration. The tasklist is just a list of parameters for each iteration
# code downloaded from internet
# enables running notebooks in parallel
from concurrent.futures import ThreadPoolExecutor
class NotebookData:
def __init__(self, path, timeout, parameters=None, retry=0):
self.path = path
self.timeout = timeout
self.parameters = parameters
self.retry = retry
def submitNotebook(notebook):
print("Running notebook {} with params: {}".format(notebook.path, notebook.parameters))
try:
if (notebook.parameters):
return dbutils.notebook.run(notebook.path, notebook.timeout, notebook.parameters)
else:
return dbutils.notebook.run(notebook.path, notebook.timeout)
except Exception:
print("FAILED: notebook {} with params: {}".format(notebook.path, notebook.parameters))
if notebook.retry < 1:
raise
print("Retrying notebook {} with params: {}".format(notebook.path, notebook.parameters))
notebook.retry = notebook.retry - 1
submitNotebook(notebook)
def parallelNotebooks(notebooks, numInParallel):
# If you create too many notebooks in parallel the driver may crash when you submit all of the jobs at once.
# This code limits the number of parallel notebooks.
with ThreadPoolExecutor(max_workers=numInParallel) as ec:
return [ec.submit(submitNotebook, notebook) for notebook in notebooks]
#get list of data to process from the config "db"
taskList = spark.sql("select * FROM Config.DeltaJobsToRun").toPandas().to_dict('records')
# create a queue of notebooks based on the metadata above. Pass into the generic param notebook and execute
# the dbutils.notebook.run() command that's used doesnt bubble up errors properly so finding the cause means clicking thru all the 'notebook job' links right now. TODO: investigate logging wrapper which will put the info into a good location
notebooks=[]
for n in taskList:
notebooks.append(NotebookData("./deltaLoadParam", 0, n, 1))
res = parallelNotebooks(notebooks, 2)
result = [f.result(timeout=3600) for f in res] # This is a blocking call.
print(result)
good luck with finding the specific iteration of a run which failed with the above approach though - the dbutils.notebook.run function does not return any valuable info nor bubble up exceptions
I plan on removing this functionality from databricks entirely and orchestrate it somewhere else
To do that create a workflow job for that notebook
then use an external orchestration tool/script which uses the databricks jobs api to kick off the jobs in parallel with the correct parameters
โ09-24-2022 01:08 AM
Hi @Hariharan Sambathโ
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group