Running notebooks on DataBricks in Azure blowing up all over since morning of Apr 5 (MST). Was there another poor deployment at DataBricks? This reall...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-05-2022 08:36 PM
Running notebooks on DataBricks in Azure blowing up all over since morning of Apr 5 (MST). Was there another poor deployment at DataBricks? This really needs to stop.
We are running premium DataBricks on Azure and calling notebooks from ADF.
10.2 (includes Apache Spark 3.2.0, Scala 2.12)
hi concurrency cluster with passthrough enabled. The config has not changed in the last two months. No issues the last month until this morning
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-05-2022 09:39 PM
I would recommend enable retry in the activity using ADF as below Hope this works
if this doesn't work, can you reply with the entire stub?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-05-2022 10:05 PM
It will not help but thank you for your input.
60% of our daily load failed twice today. This is not an isolated incident. We are talking about 40-50 concurrent DataBricks notebooks, that worked fine just yesterday and in the last seven days.
We will open MS ticket tomorrow, but this is not looking good. I have to say that an incident of this scale is an eye opener for us and a big red flag going forward with DataBricks
Multiple cells report the same error as shown below:
Failure starting repl. Try detaching and re-attaching the notebook.
java.lang.Exception: Python shell failed to start in 80 seconds
at com.databricks.backend.daemon.driver.PythonDriverLocal.startPython(PythonDriverLocal.scala:1005)
at com.databricks.backend.daemon.driver.PythonDriverLocal.init(PythonDriverLocal.scala:1012)
at com.databricks.backend.daemon.driver.PythonDriverLocal.<init>(PythonDriverLocal.scala:1020)
at com.databricks.backend.daemon.driver.PythonDriverWrapper.instantiateDriver(DriverWrapper.scala:775)
at com.databricks.backend.daemon.driver.DriverWrapper.setupRepl(DriverWrapper.scala:335)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:224)
at java.lang.Thread.run(Thread.java:748)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-06-2022 02:50 AM
It is a cluster-physical machine. If you use it for a long time, something could change. It is just a Linux machine. I don't think it is databricks fault.
By the way, Azure Data Factory is running on ... databricks. So it is just Microsoft UI for databricks.
The error can also happen by 3rd libraries/docker, which fails to download.
I also had the machine, which worked fine for months and then failed. After analysis, it was because Linux couldn't download the pyodbc library from Microsoft as their repo stopped working. I fixed it by downloading it to dbfs and use from there.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-07-2022 11:28 AM
Not yet, but I opened a ticket with MS. Grinding through the tech support...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-11-2022 09:31 AM
not much luck with resolving the issue. We were told that "Failure starting repl. Try detaching and re-attaching the notebook." is the problem is with Python shell not being able to attach.
We received two recommendations from MS:
- identify a compute cluster for nightly jobs and manually attach all notebooks to that cluster and keep it that way. - an archaic approach in my opinion for several reasons (what if you have tens of notebooks and change the cluster, think about dev, test and prod environments and the constant modifications, by the way we do not allow manual changes in test and dev, all is CI\CD). Our notebooks are kicked off from ADF and the cluster compute is assigned dynamically.
- we had a small two node cluster with auto-scale on. It was suggested that the auto-scale might be causing the issue. We disabled it without a success. We just confirmed that the error is still happening. We are seeing this in our DEV, TEST and PROD since April
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-12-2022 10:43 PM
I would suggest , to check the cluster heath from Ganglia, check if cluster is under load . This will give you more info, if this is an interactive cluster we do suggest your jobs to move to job cluster. if your use case required interactive cluster we suggest to do a regular off-on of the cluster and also, check if there is any GC failure happening for the cluster. You may ask MSFT support to open a Databricks colab case and we may look into further. Sorry for the inconvenience caused @Maciej G
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-18-2022 03:22 AM
@Maciej G try using the below init script to increase the repl timeout.
--------------------------------------
#!/bin/bash
cat > /databricks/common/conf/set_repl_timeout.conf << EOL
{
databricks.daemon.driver.launchTimeout = 150
}
EOL
--------------------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-18-2022 03:24 AM
Also, check the event logs of the cluster to understand if there was any failure while acquiring additional nodes.