Databricks Community

mkwparth · ‎06-23-2025

Hi everyone,

I'm encountering an intermittent issue when launching a Databricks pipeline cluster. Error message

com.databricks.pipelines.common.errors.deployment.DeploymentException: Failed to launch pipeline cluster xxxx-xxxxxx-ofgxxxxx: Attempt to launch cluster with invalid arguments. databricks_error_message: Spark failed to start: Driver unresponsive. Possible reasons: library conflicts, incorrect metastore configuration, and i... This error is likely due to a misconfiguration in the pipeline. Check the pipeline cluster configuration and associated cluster policy.

Interestingly, the pipeline fails 3 times with this error and then succeeds on the 4th attempt without any manual intervention.
Has anyone experienced something similar? What could be causing this "Spark failed to start: Driver unresponsive" error? Are there known configurations or best practices i should check to present this from happening in future?

Any insights would be greatly appreciated.
Thanks in advance!

Walter_C · ‎06-24-2025

If you check on the Driver logs of the cluster specifically for the log4j do you see any additional error?

mkwparth · ‎06-25-2025

Hey @Walter_C,

I don't have much idea of what to look so can you please let me know what should i look into log4j file?

Here's some of the logs containing with Error words in log4j for Perticular metrics.
25/06/23 20:41:22 INFO ErrorEventListener: Configured monitoring unexpected Java module errors with a throttling threshold of 5 unique events per 10 minutes
pipelines.cdc.enableGatewayErrorPropagation=true
SaferConf(spark.databricks.sqlservice.history.isWisErrorDiagnosticInfoTruncationEnabled,true,1748374558,241220010234395,4,None),

SaferConf(spark.sql.functions.remoteHttpClient.retryOn400TimeoutError,true,1730827108,241011231241015,3,None), SaferConf(spark.databricks.cloudFiles.recordEventChanges,false,1742410140,250318071048458,4,None),

SaferConf(spark.sql.legacy.codingErrorAction,true,1739382214,250203190102217,6,None)

let me know if you need full log4j file.
Thanks for help. Really appriciated!

Gopichand_G · ‎06-25-2025

I have personally witnessed these kind of issues.

Why these failures happen, usually as far as I have witnessed that the Driver Node might be unavailable or not responsive as you might have hit the maximum cpu or memory usage, may be your cache utilisation hit the maximum, and there could be many more reasons.

To avoid such issues I would always scheduled my workflows or Jobs with a good retry count and spread about more than 5 minutes between each retry.

Also, if the same issue is occuring every time you are running your code then you must optimize your code to read and write the data efficiently.

This worked like magic to me most of the time. EOD, it is all Availability of the compute which can never be 100 percent.

mkwparth · ‎06-25-2025

@Gopichand_G ,

I would accept your suggestions. It looks promising to me. Could you please let me know how to set a 5 minute delay after each retry?

Thanks for the help!

Databricks Community

Spark Failed to start: Driver unresponsive

Join Us as a Local Community Builder!

🚀 Announcing the Databricks Data Intelligence Platform Cheat Sheet

Find Sensitive Data at Scale with Data Classification in Unity Catalog

Solution Accelerator Series | #6 - Adverse Drug Event Detection

Announcing Backfill Runs in Lakeflow Jobs for Higher Quality Downstream Data

🚀 New: Databricks Interactive Architecture Design Workshops