cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Data Engineering
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 

Running notebooks on DataBricks in Azure blowing up all over since morning of Apr 5 (MST). Was there another poor deployment at DataBricks? This reall...

my_community2
New Contributor III

Running notebooks on DataBricks in Azure blowing up all over since morning of Apr 5 (MST). Was there another poor deployment at DataBricks? This really needs to stop.

We are running premium DataBricks on Azure and calling notebooks from ADF.

10.2 (includes Apache Spark 3.2.0, Scala 2.12)

hi concurrency cluster with passthrough enabled. The config has not changed in the last two months. No issues the last month until this morning

image

10 REPLIES 10

Aashita
Contributor III
Contributor III

I would recommend enable retry in the activity using ADF as below Hope this works

if this doesn't work, can you reply with the entire stub?

adf 

my_community2
New Contributor III

It will not help but thank you for your input.

60% of our daily load failed twice today. This is not an isolated incident. We are talking about 40-50 concurrent DataBricks notebooks, that worked fine just yesterday and in the last seven days.

We will open MS ticket tomorrow, but this is not looking good. I have to say that an incident of this scale is an eye opener for us and a big red flag going forward with DataBricks

Multiple cells report the same error as shown below:

Failure starting repl. Try detaching and re-attaching the notebook.

java.lang.Exception: Python shell failed to start in 80 seconds

at com.databricks.backend.daemon.driver.PythonDriverLocal.startPython(PythonDriverLocal.scala:1005)

at com.databricks.backend.daemon.driver.PythonDriverLocal.init(PythonDriverLocal.scala:1012)

at com.databricks.backend.daemon.driver.PythonDriverLocal.<init>(PythonDriverLocal.scala:1020)

at com.databricks.backend.daemon.driver.PythonDriverWrapper.instantiateDriver(DriverWrapper.scala:775)

at com.databricks.backend.daemon.driver.DriverWrapper.setupRepl(DriverWrapper.scala:335)

at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:224)

at java.lang.Thread.run(Thread.java:748)

Hubert-Dudek
Esteemed Contributor III

It is a cluster-physical machine. If you use it for a long time, something could change. It is just a Linux machine. I don't think it is databricks fault.

By the way, Azure Data Factory is running on ... databricks. So it is just Microsoft UI for databricks.

The error can also happen by 3rd libraries/docker, which fails to download.

I also had the machine, which worked fine for months and then failed. After analysis, it was because Linux couldn't download the pyodbc library from Microsoft as their repo stopped working. I fixed it by downloading it to dbfs and use from there.

Kaniz
Community Manager
Community Manager

Hi @Maciej Gā€‹ , Were you able to resolve this?

my_community2
New Contributor III

Not yet, but I opened a ticket with MS. Grinding through the tech support...

@Maciej Gā€‹ , Thank you for letting us know. Please let us know when you get the resolution too.

my_community2
New Contributor III

not much luck with resolving the issue. We were told that "Failure starting repl. Try detaching and re-attaching the notebook." is the problem is with Python shell not being able to attach.

We received two recommendations from MS:

  1. identify a compute cluster for nightly jobs and manually attach all notebooks to that cluster and keep it that way. - an archaic approach in my opinion for several reasons (what if you have tens of notebooks and change the cluster, think about dev, test and prod environments and the constant modifications, by the way we do not allow manual changes in test and dev, all is CI\CD). Our notebooks are kicked off from ADF and the cluster compute is assigned dynamically.
  2. we had a small two node cluster with auto-scale on. It was suggested that the auto-scale might be causing the issue. We disabled it without a success. We just confirmed that the error is still happening. We are seeing this in our DEV, TEST and PROD since April

Atanu
Esteemed Contributor
Esteemed Contributor

I would suggest , to check the cluster heath from Ganglia, check if cluster is under load . This will give you more info, if this is an interactive cluster we do suggest your jobs to move to job cluster. if your use case required interactive cluster we suggest to do a regular off-on of the cluster and also, check if there is any GC failure happening for the cluster. You may ask MSFT support to open a Databricks colab case and we may look into further. Sorry for the inconvenience caused @Maciej Gā€‹ 

Prabakar
Esteemed Contributor III
Esteemed Contributor III

@Maciej Gā€‹ try using the below init script to increase the repl timeout.

--------------------------------------

 #!/bin/bash

 cat > /databricks/common/conf/set_repl_timeout.conf << EOL

 {

  databricks.daemon.driver.launchTimeout = 150

 }

EOL

--------------------------------------

Prabakar
Esteemed Contributor III
Esteemed Contributor III

Also, check the event logs of the cluster to understand if there was any failure while acquiring additional nodes.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.