04-05-2022 08:36 PM
Running notebooks on DataBricks in Azure blowing up all over since morning of Apr 5 (MST). Was there another poor deployment at DataBricks? This really needs to stop.
We are running premium DataBricks on Azure and calling notebooks from ADF.
10.2 (includes Apache Spark 3.2.0, Scala 2.12)
hi concurrency cluster with passthrough enabled. The config has not changed in the last two months. No issues the last month until this morning
04-05-2022 09:39 PM
04-05-2022 10:05 PM
It will not help but thank you for your input.
60% of our daily load failed twice today. This is not an isolated incident. We are talking about 40-50 concurrent DataBricks notebooks, that worked fine just yesterday and in the last seven days.
We will open MS ticket tomorrow, but this is not looking good. I have to say that an incident of this scale is an eye opener for us and a big red flag going forward with DataBricks
Multiple cells report the same error as shown below:
Failure starting repl. Try detaching and re-attaching the notebook.
java.lang.Exception: Python shell failed to start in 80 seconds
at com.databricks.backend.daemon.driver.PythonDriverLocal.startPython(PythonDriverLocal.scala:1005)
at com.databricks.backend.daemon.driver.PythonDriverLocal.init(PythonDriverLocal.scala:1012)
at com.databricks.backend.daemon.driver.PythonDriverLocal.<init>(PythonDriverLocal.scala:1020)
at com.databricks.backend.daemon.driver.PythonDriverWrapper.instantiateDriver(DriverWrapper.scala:775)
at com.databricks.backend.daemon.driver.DriverWrapper.setupRepl(DriverWrapper.scala:335)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:224)
at java.lang.Thread.run(Thread.java:748)
04-06-2022 02:50 AM
It is a cluster-physical machine. If you use it for a long time, something could change. It is just a Linux machine. I don't think it is databricks fault.
By the way, Azure Data Factory is running on ... databricks. So it is just Microsoft UI for databricks.
The error can also happen by 3rd libraries/docker, which fails to download.
I also had the machine, which worked fine for months and then failed. After analysis, it was because Linux couldn't download the pyodbc library from Microsoft as their repo stopped working. I fixed it by downloading it to dbfs and use from there.
04-07-2022 11:28 AM
Not yet, but I opened a ticket with MS. Grinding through the tech support...
04-11-2022 09:31 AM
not much luck with resolving the issue. We were told that "Failure starting repl. Try detaching and re-attaching the notebook." is the problem is with Python shell not being able to attach.
We received two recommendations from MS:
05-12-2022 10:43 PM
I would suggest , to check the cluster heath from Ganglia, check if cluster is under load . This will give you more info, if this is an interactive cluster we do suggest your jobs to move to job cluster. if your use case required interactive cluster we suggest to do a regular off-on of the cluster and also, check if there is any GC failure happening for the cluster. You may ask MSFT support to open a Databricks colab case and we may look into further. Sorry for the inconvenience caused @Maciej G
05-18-2022 03:22 AM
@Maciej G try using the below init script to increase the repl timeout.
--------------------------------------
#!/bin/bash
cat > /databricks/common/conf/set_repl_timeout.conf << EOL
{
databricks.daemon.driver.launchTimeout = 150
}
EOL
--------------------------------------
05-18-2022 03:24 AM
Also, check the event logs of the cluster to understand if there was any failure while acquiring additional nodes.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group