cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Driver terminated abnormally due to FORCE_KILL

Malthe
Contributor

We have a job running on a job cluster where sometimes the driver dies:

The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.

But the metrics don't suggest an explanation for this situation.

In the logs, we see:
25/07/29 08:29:31 WARN DBRDebuggerEventReporter: Driver/10.1.2.10 got terminated abnormally due to FORCE_KILL.

I have tried to determine the cause of this, which process has lead to the forced killing of the driver. There are thousands of ERROR-level logs to be seen.

This one happens a lot and since the Jupyter kernel runs on the driver, I think it's suspicious:

25/07/29 08:20:54 ERROR JupyterKernelListener$: Unexpected error reading from iopub-poller.
java.lang.NullPointerException: Cannot invoke "com.databricks.backend.daemon.driver.OutputWidgetManager.getActiveWidgetBuffer()" because the return value of "com.databricks.backend.daemon.driver.JupyterKernelListener.outputWidgetManager()" is null

This tells me that at some point all jobs have been cancelled:

25/07/29 08:21:29 ERROR MicroBatchExecution: Nonfatal data source exception caught for query with queryActive=true: java.util.concurrent.ExecutionException: org.apache.spark.SparkException: [SPARK_JOB_CANCELLED] Job 776 cancelled as part of cancellation of all jobs SQLSTATE: HY008

This one happens a lot but it's not clear why:

25/07/29 08:21:52 ERROR EnsureRequirementsDP: Physical plan has logical operators in subqueries

This one happens a lot too:

25/07/29 08:28:51 ERROR ReplAwareSparkDataSourceListener: Unexpected exception when attempting to handle SparkListenerSQLExecutionEnd event. Please report this error, along with the following stacktrace, on https://github.com/mlflow/mlflow/issues:
java.lang.RuntimeException: Unable to find method with name tableVersion of object with class com.databricks.sql.transaction.tahoe.files.TahoeBatchFileIndex. Available methods: [...]

All of these make it difficult to understand the actual errors in the job run.

1 REPLY 1

cgrant
Databricks Employee
Databricks Employee

That error is usually related to driver load. Try upsizing the driver one size and see if it still happens.

Otherwise, for troubleshooting, driver problems are surfaced to the cluster's event log, like DRIVER_NOT_RESPONDING and DRIVER_UNAVAILABLE. You can also check the metrics page and filter just for the driver node in the dropdown, looking for high CPU/Memory utilization

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now