cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Cluster occasionally fails to launch

elikvar
New Contributor III

I have a daily running notebook that occasionally fails with the error:

"Run result unavailable: job failed with error message

Unexpected failure while waiting for the cluster Some((xxxxxxxxxxxxxxx) )to be readySome(: Cluster xxxxxxxxxxxxxxxx is in unexpected state Terminated: CONTAINER_LAUNCH_FAILURE(SERVICE_FAULT)instance_id:ixxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx,databricks_error_message:Failed to launch spark container on instance i-xxxxxxxxxxxxxxxxxxxx. Exception: Unexpected internal error, please contact Databricks support.)."

From the event log I get this message: ("Cluster terminated. Reason: Container launch failure. An unexpected error was encountered while launching containers on worker instances for the cluster. Please retry and contact Databricks if the problem persists.")

The error usually occurs every 4-5 days and there are no job logs. What is also strange is that the run duration is still the same as if the notebook did run. Anyone ever run into this issue before?

9 REPLIES 9

Anonymous
Not applicable

@Eli Kvarfordt​ :

This error message indicates that the Spark containers failed to launch on the worker instances for the cluster, which can happen for a number of reasons, including issues with the underlying infrastructure or configuration issues. Here are some steps you can take to troubleshoot and resolve the issue:

  1. Check the status of the worker instances in your cluster. Make sure that they are all up and running and that there are no issues with the underlying infrastructure. You can also check the logs of the worker instances to see if there are any errors or issues that might be causing the problem.
  2. Check the configuration of your cluster. Make sure that the configuration is correct and that there are no errors or inconsistencies. You can also try changing the configuration and see if that resolves the issue.
  3. Try restarting the cluster. Sometimes, restarting the cluster can resolve the issue. Make sure to save any important data before restarting the cluster.

It's also worth noting that if the run duration is still the same as if the notebook did run, it's possible that the notebook did actually run and complete, but the logs were not saved due to the error.

elikvar
New Contributor III

Hi suteja, thanks for the response. Unfortunately I've tried these already and everything looks normal. I have many other jobs that run different configs and a lot of them use the same one this one does but for some reason this one occasionally fails. Could it be possible that the task the job is doing could interfere with how the cluster is behaving? If my understanding is correct the cluster will autoscale to meet demand, maybe something about the job is causing the cluster to provision resources in a strange way? However, if this is the case I wonder why I don't have any databricks logs at all, it doesnt even show that the first cell ran which is just setting variables.

Anonymous
Not applicable

@Eli Kvarfordt​ :

It's certainly possible that the task your job is performing is causing issues with the cluster. For example, if the job is using up a lot of resources or generating a lot of network traffic, it could be impacting the performance of the cluster or causing it to provision resources in unexpected ways.

One thing you could try is to monitor the cluster's resource usage while the job is running, and see if there are any spikes or unusual patterns that could be related to the failure. You can use the Databricks cluster metrics dashboard to monitor the cluster's CPU, memory, and network usage in real time.

As for the lack of logs, it's possible that the failure is happening too early in the job execution process for logs to be generated. If the first cell of your notebook isn't even running, it could be that the notebook itself is failing to launch or the cluster is terminating before it even gets to the first cell. In this case, it might be helpful to try running the notebook manually outside of the job scheduler to see if you can reproduce the issue and get more information on what's happening.

Kaniz
Community Manager
Community Manager

Hi @Eli Kvarfordt​​​, We haven't heard from you since the last response from @Suteja Kanuri​ ​, and I was checking back to see if her suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others. 

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Anonymous
Not applicable

Hi @Eli Kvarfordt​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

elikvar
New Contributor III

Hi @Kaniz Fatma​ @Vidula Khanna​ @Suteja Kanuri​, I was not able to resolve the issue. I was monitoring it for a little while and it was behaving fine but it failed again today with the same issue. The only logs I have are from the Event log of the cluster shown below.

One thing I did notice is that the previous run which triggers 7 hours before had the same ADD_NODES_FAILED event but was able to eventually add the nodes and run the job. This makes me think their might be a race condition someone in the startup of the cluster? I'm not to sure how that sequence works. I also attached the first run event logs.

elikvar
New Contributor III

Heres the first run event log:

Sanm
New Contributor II

Hi @elikvar , Did you get any solution for this error. I am also getting same error "failed to add 3 containers " on cluster create and cluster start. Getting this error every time. ​i get this error and cluster ia terminated automatically.

Lebreton
New Contributor II

hello

any update on this issue ?

We have the same problem and no logs to investigate (even in dbfs when we activate the logging)

Unexpected failure while waiting for the cluster (<id of our cluster>) to be ready: Cluster <id of our cluster> is in unexpected state Terminated: CONTAINER_LAUNCH_FAILURE(SERVICE_FAULT): instance_id:x-xxxxxxxx,databricks_error_message:Failed to launch spark container on instance x-xxxxxxx. Exception: Unexpected internal error, please contact Databricks support

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.