cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Intermittent secret resolution error service fault in GCP

cmditch
New Contributor II

Experiencing the error below in GCP when starting a cluster (both manually and in jobs). It's causing our ETL and other production jobs to fail multiple times a week. Its intermittent, but requires manual intervention to retry scheduled jobs.

 

run failed with error message Unexpected failure while waiting for the cluster (0817-041248-m827uwd4) to be ready: Cluster 0817-041248-m827uwd4 is in unexpected state Terminating: SECRET_RESOLUTION_ERROR(SERVICE_FAULT): databricks_error_message:Cannot fetch secrets referred in the Spark Environment Variables due to internal error.

 

 

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @cmditchBased on the error message, the issue seems to be related to the resolution of secrets referred to in the Spark Environment Variables.

This error indicates that there is an internal error when fetching these secrets.

Here are a few potential causes and solutions:

1. **Permissions**: The Service Account (SA) used by your Databricks workspace might not have the correct permissions to access the secrets. Ensure the SA has the necessary permissions, such as’ Compute Storage Admin',’ Databricks Service IAM Role for Workspace’, and’ Kubernetes Engine Admin’.

2. **Secret Existence**: The secrets referred to in the Spark Environment Variables might not exist or have been deleted. Verify that these secrets exist and are correctly configured.

3. **Internal Connectivity Issues**: Internal network connectivity issues might prevent the secrets from being fetched. Check your network configuration and ensure that there is proper connectivity.

Unfortunately, providing a more specific solution without more detailed logs or information about your setup is challenging. If the problem persists, consider contacting Databricks support by filing a support ticket here.

cmditch
New Contributor II

Thanks @Kaniz . 1 and 2 are confirmed fine. I would imagine 3 to not result in intermittent failures if it were a config issue, but perhaps it's another network related issue that would be susceptible to intermittent failure.

The link you provided is for a training request. Is there another place where I can file a bug report?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.