cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Spark Failed to start: Driver unresponsive

mkwparth
New Contributor III

Hi everyone,

I'm encountering an intermittent issue when launching a Databricks pipeline cluster. Error message

com.databricks.pipelines.common.errors.deployment.DeploymentException: Failed to launch pipeline cluster xxxx-xxxxxx-ofgxxxxx: Attempt to launch cluster with invalid arguments. databricks_error_message: Spark failed to start: Driver unresponsive. Possible reasons: library conflicts, incorrect metastore configuration, and i... This error is likely due to a misconfiguration in the pipeline. Check the pipeline cluster configuration and associated cluster policy.

Interestingly, the pipeline fails 3 times with this error and then succeeds on the 4th attempt without any manual intervention.
Has anyone experienced something similar? What could be causing this  "Spark failed to start: Driver unresponsive" error? Are there known configurations or best practices i should check to present this from happening in future?

Any insights would be greatly appreciated.
Thanks in advance!

4 REPLIES 4

Walter_C
Databricks Employee
Databricks Employee

If you check on the Driver logs of the cluster specifically for the log4j do you see any additional error?

 

mkwparth
New Contributor III

Hey @Walter_C,

I don't have much idea of what to look so can you please let me know what should i look into log4j file?

Here's some of the logs containing with Error words in log4j for Perticular metrics.
25/06/23 20:41:22 INFO ErrorEventListener: Configured monitoring unexpected Java module errors with a throttling threshold of 5 unique events per 10 minutes
pipelines.cdc.enableGatewayErrorPropagation=true
SaferConf(spark.databricks.sqlservice.history.isWisErrorDiagnosticInfoTruncationEnabled,true,1748374558,241220010234395,4,None),

SaferConf(spark.sql.functions.remoteHttpClient.retryOn400TimeoutError,true,1730827108,241011231241015,3,None), SaferConf(spark.databricks.cloudFiles.recordEventChanges,false,1742410140,250318071048458,4,None),

SaferConf(spark.sql.legacy.codingErrorAction,true,1739382214,250203190102217,6,None)

let me know if you need full log4j file.
Thanks for help. Really appriciated!

Gopichand_G
New Contributor II

I have personally witnessed these kind of issues. 

Why these failures happen, usually as far as I have witnessed that the Driver Node might be unavailable or not responsive as you might have hit the maximum cpu or memory usage, may be your cache utilisation hit the maximum, and there could be many more reasons.

 

To avoid such issues I would always scheduled my workflows or Jobs with a good retry count and spread about more than 5 minutes between each retry.

 

Also, if the same issue is occuring every time you are running your code then you must optimize your code to read and write the data efficiently.

This worked like magic to me most of the time. EOD, it is all Availability of the compute which can never be 100 percent.

 

mkwparth
New Contributor III

@Gopichand_G ,

I would accept your suggestions. It looks promising to me. Could you please let me know how to set a 5 minute delay after each retry?

Thanks for the help!

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now