Driver terminated abnormally due to FORCE_KILL

Malthe — Tue, 29 Jul 2025 11:46:59 GMT

We have a job running on a job cluster where sometimes the driver dies:

> The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.

But the metrics don't suggest an explanation for this situation.

In the logs, we see:
25/07/29 08:29:31 WARN DBRDebuggerEventReporter: Driver/10.1.2.10 got terminated abnormally due to FORCE_KILL.

I have tried to determine the cause of this, which process has lead to the forced killing of the driver. There are thousands of ERROR-level logs to be seen.

This one happens a lot and since the Jupyter kernel runs on the driver, I think it's suspicious:

25/07/29 08:20:54 ERROR JupyterKernelListener$: Unexpected error reading from iopub-poller.
java.lang.NullPointerException: Cannot invoke "com.databricks.backend.daemon.driver.OutputWidgetManager.getActiveWidgetBuffer()" because the return value of "com.databricks.backend.daemon.driver.JupyterKernelListener.outputWidgetManager()" is null

This tells me that at some point all jobs have been cancelled:

25/07/29 08:21:29 ERROR MicroBatchExecution: Nonfatal data source exception caught for query with queryActive=true: java.util.concurrent.ExecutionException: org.apache.spark.SparkException: [SPARK_JOB_CANCELLED] Job 776 cancelled as part of cancellation of all jobs SQLSTATE: HY008

This one happens a lot but it's not clear why:

25/07/29 08:21:52 ERROR EnsureRequirementsDP: Physical plan has logical operators in subqueries

This one happens a lot too:

25/07/29 08:28:51 ERROR ReplAwareSparkDataSourceListener: Unexpected exception when attempting to handle SparkListenerSQLExecutionEnd event. Please report this error, along with the following stacktrace, on https://github.com/mlflow/mlflow/issues:
java.lang.RuntimeException: Unable to find method with name tableVersion of object with class com.databricks.sql.transaction.tahoe.files.TahoeBatchFileIndex. Available methods: [...]

All of these make it difficult to understand the actual errors in the job run.

Re: Driver terminated abnormally due to FORCE_KILL

cgrant — Sat, 09 Aug 2025 18:38:21 GMT

That error is usually related to driver load. Try upsizing the driver one size and see if it still happens.

Otherwise, for troubleshooting, driver problems are surfaced to the cluster's event log, like DRIVER_NOT_RESPONDING and DRIVER_UNAVAILABLE. You can also check the metrics page and filter just for the driver node in the dropdown, looking for high CPU/Memory utilization

topic Re: Driver terminated abnormally due to FORCE_KILL in Data Engineering

Driver terminated abnormally due to FORCE_KILL

Re: Driver terminated abnormally due to FORCE_KILL