cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Driver restarts and job dies after 10-20 hours (Structured Streaming)

Ossian
New Contributor

I am running a java/jar Structured Streaming job on a single node cluster (Databricks runtime 8.3). The job contains a single query which reads records from multiple Azure Event Hubs using Spark Kafka functionality and outputs results to a mssql database on Azure using a foreachBatch sink. When not started from any previous checkpoint, the job usually runs for 12-14 hours before stopping. The job will successfully restart from a checkpoint after it has died and has been observed to run for 19 hours before stopping again. Why is does my job halt?

According to the logs, before the job stops the driver seems to first spontaneously attempt a restart (I can't find anything in the logs before this point which would indicate that a restart was going to happen):

21/07/21 01:30:12 INFO StaticConf$: DB_HOME: /databricks
21/07/21 01:30:14 INFO DriverDaemon$: Current JVM Version 1.8.0_282
21/07/21 01:30:14 INFO DriverDaemon$: ========== driver starting up ==========
21/07/21 01:30:14 INFO DriverDaemon$: Java: Azul Systems, Inc. 1.8.0_282
21/07/21 01:30:14 INFO DriverDaemon$: OS: Linux/amd64 5.4.0-1051-azure
21/07/21 01:30:14 INFO DriverDaemon$: CWD: /databricks/driver
...

This is then followed by logging of the spark configuration and so on.

After a few seconds we then get

21/07/21 01:30:47 ERROR RShell: Failed to evaluate init script at path '/local_disk0/tmp/_CleanRShell.r1768656956904795188resource.r'
21/07/21 01:30:47 ERROR RDriverLocal: Starting R interpreter failed.
com.databricks.backend.daemon.driver.RDriverLocal$RProcessFatalException
        at com.databricks.backend.daemon.driver.RShell.$anonfun$new$1(RShell.scala:57)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at com.databricks.backend.daemon.driver.RShell.<init>(RShell.scala:50)
        at com.databricks.backend.daemon.driver.RDriverLocal.init(RDriverLocal.scala:631)
        at com.databricks.backend.daemon.driver.RDriverLocal$.init(RDriverLocal.scala:1078)
        at com.databricks.backend.daemon.driver.RDriverWrapper.instantiateDriver(DriverWrapper.scala:819)
        at com.databricks.backend.daemon.driver.DriverWrapper.setupRepl(DriverWrapper.scala:331)
        at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:220)
        at java.lang.Thread.run(Thread.java:748)

followed by some more related errors.

The job jar is successfully re-attached to the restarted cluster according to the logs, however the job does not seem to restart since the cluster then goes down after my configured cluster "Terminate after" setting of one hour. I have not configured automatic job restarts to debug this problem more easily.

If it is of interest, the database used as a sink is very slow and the main bottleneck of the query, causing each micro-batch of ~100000 records each to take progressively longer, starting at ~5 min and after 20 micro-batches taking around 2 hours. I hope this is not causing problems, since I don't want the job die if such database bottlenecking would then happen at some point in production.

Also, the job needs to store about 50 GB of data as state (more than the RAM available on my cluster). To not run out of memory I am using the Databricks RocksDB state store.

Thanks

1 REPLY 1

Aviral-Bhardwaj
Esteemed Contributor III

its seems that when your nodes are increasing it is seeking for init script and it is failing so you can use reserve instances for this activity instead of spot instances

it will increase your overall cost

or alternatively, you can use depended library option in the databricks,may be this way you can solve this situation

AviralBhardwaj

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group