Databricks

DipakBachhav · ‎10-03-2022

I am trying to run the below notebook through databricks but getting the below error. I have tried to update the notebook timeout and the retry mechanism but still no luck yet.

NotebookData("/Users/mynotebook",9900, retry=3)

]

res = parallelNotebooks(notebooks, 2)

result = [f.result(timeout=9900) for f in res] # This is a blocking call.

print(result)

Can someone please help me to sort out this issue? Thanks

%python

from concurrent.futures import ThreadPoolExecutor

class NotebookData:

def __init__(self, path, timeout, parameters=None, retry=0):

self.path = path

self.timeout = timeout

self.parameters = parameters

self.retry = retry

def submitNotebook(notebook):

print("Running notebook %s" % notebook.path)

try:

if (notebook.parameters):

return dbutils.notebook.run(notebook.path, notebook.timeout, notebook.parameters)

else:

return dbutils.notebook.run(notebook.path, notebook.timeout)

except Exception:

if notebook.retry < 1:

raise

print("Retrying notebook %s" % notebook.path)

notebook.retry = notebook.retry - 1

submitNotebook(notebook)

def parallelNotebooks(notebooks, numInParallel):

# This code limits the number of parallel notebooks.

with ThreadPoolExecutor(max_workers=numInParallel) as ec:

return [ec.submit(submitNotebook, notebook) for notebook in notebooks]

notebooks = [

NotebookData("/Users/mynotebook",1200000, retry=0)

]

res = parallelNotebooks(notebooks, 2)

result = [f.result(timeout=1200000) for f in res] # This is a blocking call.

print(result)

Error:

Py4JJavaError: An error occurred while calling o1741._run.

: com.databricks.WorkflowException: com.databricks.NotebookExecutionException: FAILED

at com.databricks.workflow.WorkflowDriver.run(WorkflowDriver.scala:95)

at com.databricks.dbutils_v1.impl.NotebookUtilsImpl.run(NotebookUtilsImpl.scala:122)

at com.databricks.dbutils_v1.impl.NotebookUtilsImpl._run(NotebookUtilsImpl.scala:89)

at sun.reflect.GeneratedMethodAccessor820.invoke(Unknown Source)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)

at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)

at py4j.Gateway.invoke(Gateway.java:295)

at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

at py4j.commands.CallCommand.execute(CallCommand.java:79)

at py4j.GatewayConnection.run(GatewayConnection.java:251)

at java.lang.Thread.run(Thread.java:748)

Caused by: com.databricks.NotebookExecutionException: FAILED

at com.databricks.workflow.WorkflowDriver.run0(WorkflowDriver.scala:141)

at com.databricks.workflow.WorkflowDriver.run(WorkflowDriver.scala:90)

... 12 more

Hubert-Dudek · ‎10-03-2022

Not sure what this code does, but spark executes job by job, so ThreadPoolExecutor doesn't make much sense. If you want to execute notebooks in parallel, please run them as separate jobs with a fair scheduler (so you reserve resources for each notebook - in first line sc.setLocalProperty("spark.scheduler.pool", "somename") when somename is unique for your parallel notebook execution)

View solution in original post

Hubert-Dudek · ‎10-03-2022

Not sure what this code does, but spark executes job by job, so ThreadPoolExecutor doesn't make much sense. If you want to execute notebooks in parallel, please run them as separate jobs with a fair scheduler (so you reserve resources for each notebook - in first line sc.setLocalProperty("spark.scheduler.pool", "somename") when somename is unique for your parallel notebook execution)

cweymouth · ‎03-15-2023

Just a quibble here. It makes great sense to run notebooks in parallel and can greatly increase performance. On a relatively small cluster (4 nodes, 16 vCPUs total), I was able to cut my job's runtime in half.

When you spawn multiple processes, it allows you to have better utilization of the cluster. In sequential notebook runs, there is often downtime for executors while the remaining tasks for a spark job are completing. Running multiple python processes in the driver node allows you to begin work on other spark tasks.

Kaniz · ‎10-25-2022

Hi @Dipak Bachhav , We haven’t heard from you since the last response from @Hubert Dudek , and I was checking back to see if you have a resolution yet.

If you have any solution, please share it with the community as it can be helpful to others. Otherwise, we will respond with more details and try to help.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

sujai_sparks · ‎11-28-2022

Hi @Dipak Bachhav, not sure if you have fixed the issue, but here are few things you can check:

Is the path "/Users/mynotebook" correct? Maybe you are missing the dot in the beginning.
Run the notebook using dbutils.notebook.run("/Users/mynotebook") and see there are any errors.

Databricks

Geting error Caused by: com.databricks.NotebookExecutionException: FAILED

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI