Databricks

my_community2 · ‎04-05-2022

Running notebooks on DataBricks in Azure blowing up all over since morning of Apr 5 (MST). Was there another poor deployment at DataBricks? This really needs to stop.

We are running premium DataBricks on Azure and calling notebooks from ADF.

10.2 (includes Apache Spark 3.2.0, Scala 2.12)

hi concurrency cluster with passthrough enabled. The config has not changed in the last two months. No issues the last month until this morning

Aashita · ‎04-05-2022

I would recommend enable retry in the activity using ADF as below Hope this works

if this doesn't work, can you reply with the entire stub?

my_community2 · ‎04-05-2022

It will not help but thank you for your input.

60% of our daily load failed twice today. This is not an isolated incident. We are talking about 40-50 concurrent DataBricks notebooks, that worked fine just yesterday and in the last seven days.

We will open MS ticket tomorrow, but this is not looking good. I have to say that an incident of this scale is an eye opener for us and a big red flag going forward with DataBricks

Multiple cells report the same error as shown below:

Failure starting repl. Try detaching and re-attaching the notebook.

java.lang.Exception: Python shell failed to start in 80 seconds

at com.databricks.backend.daemon.driver.PythonDriverLocal.startPython(PythonDriverLocal.scala:1005)

at com.databricks.backend.daemon.driver.PythonDriverLocal.init(PythonDriverLocal.scala:1012)

at com.databricks.backend.daemon.driver.PythonDriverLocal.<init>(PythonDriverLocal.scala:1020)

at com.databricks.backend.daemon.driver.PythonDriverWrapper.instantiateDriver(DriverWrapper.scala:775)

at com.databricks.backend.daemon.driver.DriverWrapper.setupRepl(DriverWrapper.scala:335)

at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:224)

at java.lang.Thread.run(Thread.java:748)

Hubert-Dudek · ‎04-06-2022

It is a cluster-physical machine. If you use it for a long time, something could change. It is just a Linux machine. I don't think it is databricks fault.

By the way, Azure Data Factory is running on ... databricks. So it is just Microsoft UI for databricks.

The error can also happen by 3rd libraries/docker, which fails to download.

I also had the machine, which worked fine for months and then failed. After analysis, it was because Linux couldn't download the pyodbc library from Microsoft as their repo stopped working. I fixed it by downloading it to dbfs and use from there.

Kaniz · ‎04-07-2022

Hi @Maciej G , Were you able to resolve this?

my_community2 · ‎04-07-2022

Not yet, but I opened a ticket with MS. Grinding through the tech support...

Kaniz · ‎04-08-2022

@Maciej G , Thank you for letting us know. Please let us know when you get the resolution too.

my_community2 · ‎04-11-2022

not much luck with resolving the issue. We were told that "Failure starting repl. Try detaching and re-attaching the notebook." is the problem is with Python shell not being able to attach.

We received two recommendations from MS:

identify a compute cluster for nightly jobs and manually attach all notebooks to that cluster and keep it that way. - an archaic approach in my opinion for several reasons (what if you have tens of notebooks and change the cluster, think about dev, test and prod environments and the constant modifications, by the way we do not allow manual changes in test and dev, all is CI\CD). Our notebooks are kicked off from ADF and the cluster compute is assigned dynamically.
we had a small two node cluster with auto-scale on. It was suggested that the auto-scale might be causing the issue. We disabled it without a success. We just confirmed that the error is still happening. We are seeing this in our DEV, TEST and PROD since April

Atanu · ‎05-12-2022

I would suggest , to check the cluster heath from Ganglia, check if cluster is under load . This will give you more info, if this is an interactive cluster we do suggest your jobs to move to job cluster. if your use case required interactive cluster we suggest to do a regular off-on of the cluster and also, check if there is any GC failure happening for the cluster. You may ask MSFT support to open a Databricks colab case and we may look into further. Sorry for the inconvenience caused @Maciej G