We have a job running on a job cluster where sometimes the driver dies:
> The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.
But the metrics don't suggest an explanation for this situation.
In the logs, we see:
25/07/29 08:29:31 WARN DBRDebuggerEventReporter: Driver/10.1.2.10 got terminated abnormally due to FORCE_KILL.
I have tried to determine the cause of this, which process has lead to the forced killing of the driver. There are thousands of ERROR-level logs to be seen.
This one happens a lot and since the Jupyter kernel runs on the driver, I think it's suspicious:
25/07/29 08:20:54 ERROR JupyterKernelListener$: Unexpected error reading from iopub-poller.
java.lang.NullPointerException: Cannot invoke "com.databricks.backend.daemon.driver.OutputWidgetManager.getActiveWidgetBuffer()" because the return value of "com.databricks.backend.daemon.driver.JupyterKernelListener.outputWidgetManager()" is null
This tells me that at some point all jobs have been cancelled:
25/07/29 08:21:29 ERROR MicroBatchExecution: Nonfatal data source exception caught for query with queryActive=true: java.util.concurrent.ExecutionException: org.apache.spark.SparkException: [SPARK_JOB_CANCELLED] Job 776 cancelled as part of cancellation of all jobs SQLSTATE: HY008
This one happens a lot but it's not clear why:
25/07/29 08:21:52 ERROR EnsureRequirementsDP: Physical plan has logical operators in subqueries
This one happens a lot too:
25/07/29 08:28:51 ERROR ReplAwareSparkDataSourceListener: Unexpected exception when attempting to handle SparkListenerSQLExecutionEnd event. Please report this error, along with the following stacktrace, on https://github.com/mlflow/mlflow/issues:
java.lang.RuntimeException: Unable to find method with name tableVersion of object with class com.databricks.sql.transaction.tahoe.files.TahoeBatchFileIndex. Available methods: [...]
All of these make it difficult to understand the actual errors in the job run.