3 weeks ago
Hi everyone,
I'm encountering an intermittent issue when launching a Databricks pipeline cluster. Error message
com.databricks.pipelines.common.errors.deployment.DeploymentException: Failed to launch pipeline cluster xxxx-xxxxxx-ofgxxxxx: Attempt to launch cluster with invalid arguments. databricks_error_message: Spark failed to start: Driver unresponsive. Possible reasons: library conflicts, incorrect metastore configuration, and i... This error is likely due to a misconfiguration in the pipeline. Check the pipeline cluster configuration and associated cluster policy.
Interestingly, the pipeline fails 3 times with this error and then succeeds on the 4th attempt without any manual intervention.
Has anyone experienced something similar? What could be causing this "Spark failed to start: Driver unresponsive" error? Are there known configurations or best practices i should check to present this from happening in future?
Any insights would be greatly appreciated.
Thanks in advance!
3 weeks ago
If you check on the Driver logs of the cluster specifically for the log4j do you see any additional error?
3 weeks ago
Hey @Walter_C,
I don't have much idea of what to look so can you please let me know what should i look into log4j file?
Here's some of the logs containing with Error words in log4j for Perticular metrics.
25/06/23 20:41:22 INFO ErrorEventListener: Configured monitoring unexpected Java module errors with a throttling threshold of 5 unique events per 10 minutes
pipelines.cdc.enableGatewayErrorPropagation=true
SaferConf(spark.databricks.sqlservice.history.isWisErrorDiagnosticInfoTruncationEnabled,true,1748374558,241220010234395,4,None),
SaferConf(spark.sql.functions.remoteHttpClient.retryOn400TimeoutError,true,1730827108,241011231241015,3,None), SaferConf(spark.databricks.cloudFiles.recordEventChanges,false,1742410140,250318071048458,4,None),
SaferConf(spark.sql.legacy.codingErrorAction,true,1739382214,250203190102217,6,None)
let me know if you need full log4j file.
Thanks for help. Really appriciated!
3 weeks ago
I have personally witnessed these kind of issues.
Why these failures happen, usually as far as I have witnessed that the Driver Node might be unavailable or not responsive as you might have hit the maximum cpu or memory usage, may be your cache utilisation hit the maximum, and there could be many more reasons.
To avoid such issues I would always scheduled my workflows or Jobs with a good retry count and spread about more than 5 minutes between each retry.
Also, if the same issue is occuring every time you are running your code then you must optimize your code to read and write the data efficiently.
This worked like magic to me most of the time. EOD, it is all Availability of the compute which can never be 100 percent.
3 weeks ago
@Gopichand_G ,
I would accept your suggestions. It looks promising to me. Could you please let me know how to set a 5 minute delay after each retry?
Thanks for the help!
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now