04-22-2024 05:16 AM
Hello All,
My scenario required me to create a code that reads tables from the source catalog and writes them to the destination catalog using Spark. Doing one by one is not a good option when there are 300 tables in the catalog. So I am trying the process pool executor, which creates a separate process for each table and runs it concurrently as much as possible. When I created a code, it throws an error like "PythonUtils does not exist in the JVM". Essentially, I made two notebooks: one for the parent and one for the child. I'm going to share my code, could you advise me on how to solve and implement parallelism for my task?
#parent notebook code
04-23-2024 01:19 AM
Hi @ETLdeveloper
You can use the multithreading that help you to run notebook in parallel.
Attaching code for your reference -
from concurrent.futures import ThreadPoolExecutor
class NotebookData:
def __init__(self, path, timeout, parameters = None, retry = 0):
self.path = path
self.timeout = timeout
self.parameters = parameters
self.retry = retry
def submit_notebook(notebook):
# print("Running URL for Table : %s " % (notebook.parameters['url']))
try:
if (notebook.parameters):
return dbutils.notebook.run(notebook.path, notebook.timeout, notebook.parameters)
else:
return dbutils.notebook.run(notebook.path, notebook.timeout)
except Exception as e:
if notebook.retry < 1:
print("Failed For : ",notebook.parameters)
raise
# print("Retrying for : %s " % (notebook.parameters['url']))
notebook.retry = notebook.retry - 1
submit_notebook(notebook)
def parallel_notebooks(notebooks, parallel_thread):
"""
If you create too many notebooks in parallel the driver may crash when you submit all of the jobs at once.
This code limits the number of parallel notebooks.
"""
with ThreadPoolExecutor(max_workers = parallel_thread) as ec:
return [ec.submit(NotebookData.submit_notebook, notebook) for notebook in notebooks]
notebooks = [NotebookData(f"/Workspace/live/raw/test/abc", 3600, {"url" : f'{url}'}) for url in var_dict.values()]
# #Array of instances of NotebookData Class
parallel_thread = 60
try :
res = parallel_notebooks(notebooks, parallel_thread)
result = [i.result(timeout = 3600) for i in res] # This is a blocking call.
print(result)
except NameError as e :
print(e)
04-23-2024 01:19 AM
Hi @ETLdeveloper
You can use the multithreading that help you to run notebook in parallel.
Attaching code for your reference -
from concurrent.futures import ThreadPoolExecutor
class NotebookData:
def __init__(self, path, timeout, parameters = None, retry = 0):
self.path = path
self.timeout = timeout
self.parameters = parameters
self.retry = retry
def submit_notebook(notebook):
# print("Running URL for Table : %s " % (notebook.parameters['url']))
try:
if (notebook.parameters):
return dbutils.notebook.run(notebook.path, notebook.timeout, notebook.parameters)
else:
return dbutils.notebook.run(notebook.path, notebook.timeout)
except Exception as e:
if notebook.retry < 1:
print("Failed For : ",notebook.parameters)
raise
# print("Retrying for : %s " % (notebook.parameters['url']))
notebook.retry = notebook.retry - 1
submit_notebook(notebook)
def parallel_notebooks(notebooks, parallel_thread):
"""
If you create too many notebooks in parallel the driver may crash when you submit all of the jobs at once.
This code limits the number of parallel notebooks.
"""
with ThreadPoolExecutor(max_workers = parallel_thread) as ec:
return [ec.submit(NotebookData.submit_notebook, notebook) for notebook in notebooks]
notebooks = [NotebookData(f"/Workspace/live/raw/test/abc", 3600, {"url" : f'{url}'}) for url in var_dict.values()]
# #Array of instances of NotebookData Class
parallel_thread = 60
try :
res = parallel_notebooks(notebooks, parallel_thread)
result = [i.result(timeout = 3600) for i in res] # This is a blocking call.
print(result)
except NameError as e :
print(e)
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group