I am running a java/jar Structured Streaming job on a single node cluster (Databricks runtime 8.3). The job contains a single query which reads records from multiple Azure Event Hubs using Spark Kafka functionality and outputs results to a mssql database on Azure using a foreachBatch sink. When not started from any previous checkpoint, the job usually runs for 12-14 hours before stopping. The job will successfully restart from a checkpoint after it has died and has been observed to run for 19 hours before stopping again. Why is does my job halt?
According to the logs, before the job stops the driver seems to first spontaneously attempt a restart (I can't find anything in the logs before this point which would indicate that a restart was going to happen):
21/07/21 01:30:12 INFO StaticConf$: DB_HOME: /databricks
21/07/21 01:30:14 INFO DriverDaemon$: Current JVM Version 1.8.0_282
21/07/21 01:30:14 INFO DriverDaemon$: ========== driver starting up ==========
21/07/21 01:30:14 INFO DriverDaemon$: Java: Azul Systems, Inc. 1.8.0_282
21/07/21 01:30:14 INFO DriverDaemon$: OS: Linux/amd64 5.4.0-1051-azure
21/07/21 01:30:14 INFO DriverDaemon$: CWD: /databricks/driver
...
This is then followed by logging of the spark configuration and so on.
After a few seconds we then get
21/07/21 01:30:47 ERROR RShell: Failed to evaluate init script at path '/local_disk0/tmp/_CleanRShell.r1768656956904795188resource.r'
21/07/21 01:30:47 ERROR RDriverLocal: Starting R interpreter failed.
com.databricks.backend.daemon.driver.RDriverLocal$RProcessFatalException
at com.databricks.backend.daemon.driver.RShell.$anonfun$new$1(RShell.scala:57)
at scala.collection.immutable.List.foreach(List.scala:392)
at com.databricks.backend.daemon.driver.RShell.<init>(RShell.scala:50)
at com.databricks.backend.daemon.driver.RDriverLocal.init(RDriverLocal.scala:631)
at com.databricks.backend.daemon.driver.RDriverLocal$.init(RDriverLocal.scala:1078)
at com.databricks.backend.daemon.driver.RDriverWrapper.instantiateDriver(DriverWrapper.scala:819)
at com.databricks.backend.daemon.driver.DriverWrapper.setupRepl(DriverWrapper.scala:331)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:220)
at java.lang.Thread.run(Thread.java:748)
followed by some more related errors.
The job jar is successfully re-attached to the restarted cluster according to the logs, however the job does not seem to restart since the cluster then goes down after my configured cluster "Terminate after" setting of one hour. I have not configured automatic job restarts to debug this problem more easily.
If it is of interest, the database used as a sink is very slow and the main bottleneck of the query, causing each micro-batch of ~100000 records each to take progressively longer, starting at ~5 min and after 20 micro-batches taking around 2 hours. I hope this is not causing problems, since I don't want the job die if such database bottlenecking would then happen at some point in production.
Also, the job needs to store about 50 GB of data as state (more than the RAM available on my cluster). To not run out of memory I am using the Databricks RocksDB state store.
Thanks