Databricks

Ossian · ‎07-21-2021

I am running a java/jar Structured Streaming job on a single node cluster (Databricks runtime 8.3). The job contains a single query which reads records from multiple Azure Event Hubs using Spark Kafka functionality and outputs results to a mssql database on Azure using a foreachBatch sink. When not started from any previous checkpoint, the job usually runs for 12-14 hours before stopping. The job will successfully restart from a checkpoint after it has died and has been observed to run for 19 hours before stopping again. Why is does my job halt?

According to the logs, before the job stops the driver seems to first spontaneously attempt a restart (I can't find anything in the logs before this point which would indicate that a restart was going to happen):

21/07/21 01:30:12 INFO StaticConf$: DB_HOME: /databricks
21/07/21 01:30:14 INFO DriverDaemon$: Current JVM Version 1.8.0_282
21/07/21 01:30:14 INFO DriverDaemon$: ========== driver starting up ==========
21/07/21 01:30:14 INFO DriverDaemon$: Java: Azul Systems, Inc. 1.8.0_282
21/07/21 01:30:14 INFO DriverDaemon$: OS: Linux/amd64 5.4.0-1051-azure
21/07/21 01:30:14 INFO DriverDaemon$: CWD: /databricks/driver
...

This is then followed by logging of the spark configuration and so on.

After a few seconds we then get

21/07/21 01:30:47 ERROR RShell: Failed to evaluate init script at path '/local_disk0/tmp/_CleanRShell.r1768656956904795188resource.r'
21/07/21 01:30:47 ERROR RDriverLocal: Starting R interpreter failed.
com.databricks.backend.daemon.driver.RDriverLocal$RProcessFatalException
        at com.databricks.backend.daemon.driver.RShell.$anonfun$new$1(RShell.scala:57)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at com.databricks.backend.daemon.driver.RShell.<init>(RShell.scala:50)
        at com.databricks.backend.daemon.driver.RDriverLocal.init(RDriverLocal.scala:631)
        at com.databricks.backend.daemon.driver.RDriverLocal$.init(RDriverLocal.scala:1078)
        at com.databricks.backend.daemon.driver.RDriverWrapper.instantiateDriver(DriverWrapper.scala:819)
        at com.databricks.backend.daemon.driver.DriverWrapper.setupRepl(DriverWrapper.scala:331)
        at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:220)
        at java.lang.Thread.run(Thread.java:748)

followed by some more related errors.

The job jar is successfully re-attached to the restarted cluster according to the logs, however the job does not seem to restart since the cluster then goes down after my configured cluster "Terminate after" setting of one hour. I have not configured automatic job restarts to debug this problem more easily.

If it is of interest, the database used as a sink is very slow and the main bottleneck of the query, causing each micro-batch of ~100000 records each to take progressively longer, starting at ~5 min and after 20 micro-batches taking around 2 hours. I hope this is not causing problems, since I don't want the job die if such database bottlenecking would then happen at some point in production.

Also, the job needs to store about 50 GB of data as state (more than the RAM available on my cluster). To not run out of memory I am using the Databricks RocksDB state store.

Thanks

Aviral-Bhardwaj · ‎12-10-2022

its seems that when your nodes are increasing it is seeking for init script and it is failing so you can use reserve instances for this activity instead of spot instances

it will increase your overall cost

or alternatively, you can use depended library option in the databricks,may be this way you can solve this situation

Databricks

Driver restarts and job dies after 10-20 hours (Structured Streaming)

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs