Databricks Community

Malthe · ‎07-29-2025

We have a job running on a job cluster where sometimes the driver dies:

> The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.

But the metrics don't suggest an explanation for this situation.

In the logs, we see:
25/07/29 08:29:31 WARN DBRDebuggerEventReporter: Driver/10.1.2.10 got terminated abnormally due to FORCE_KILL.

I have tried to determine the cause of this, which process has lead to the forced killing of the driver. There are thousands of ERROR-level logs to be seen.

This one happens a lot and since the Jupyter kernel runs on the driver, I think it's suspicious:

25/07/29 08:20:54 ERROR JupyterKernelListener$: Unexpected error reading from iopub-poller.
java.lang.NullPointerException: Cannot invoke "com.databricks.backend.daemon.driver.OutputWidgetManager.getActiveWidgetBuffer()" because the return value of "com.databricks.backend.daemon.driver.JupyterKernelListener.outputWidgetManager()" is null

This tells me that at some point all jobs have been cancelled:

25/07/29 08:21:29 ERROR MicroBatchExecution: Nonfatal data source exception caught for query with queryActive=true: java.util.concurrent.ExecutionException: org.apache.spark.SparkException: [SPARK_JOB_CANCELLED] Job 776 cancelled as part of cancellation of all jobs SQLSTATE: HY008

This one happens a lot but it's not clear why:

25/07/29 08:21:52 ERROR EnsureRequirementsDP: Physical plan has logical operators in subqueries

This one happens a lot too:

25/07/29 08:28:51 ERROR ReplAwareSparkDataSourceListener: Unexpected exception when attempting to handle SparkListenerSQLExecutionEnd event. Please report this error, along with the following stacktrace, on https://github.com/mlflow/mlflow/issues:
java.lang.RuntimeException: Unable to find method with name tableVersion of object with class com.databricks.sql.transaction.tahoe.files.TahoeBatchFileIndex. Available methods: [...]

All of these make it difficult to understand the actual errors in the job run.

cgrant · ‎08-09-2025

That error is usually related to driver load. Try upsizing the driver one size and see if it still happens.

Otherwise, for troubleshooting, driver problems are surfaced to the cluster's event log, like DRIVER_NOT_RESPONDING and DRIVER_UNAVAILABLE. You can also check the metrics page and filter just for the driver node in the dropdown, looking for high CPU/Memory utilization

Databricks Community

Driver terminated abnormally due to FORCE_KILL

Join Us as a Local Community Builder!

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐