cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Cluster occasionally fails to launch

elikvar
New Contributor III

I have a daily running notebook that occasionally fails with the error:

"Run result unavailable: job failed with error message

Unexpected failure while waiting for the cluster Some((xxxxxxxxxxxxxxx) )to be readySome(: Cluster xxxxxxxxxxxxxxxx is in unexpected state Terminated: CONTAINER_LAUNCH_FAILURE(SERVICE_FAULT)instance_id:ixxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx,databricks_error_message:Failed to launch spark container on instance i-xxxxxxxxxxxxxxxxxxxx. Exception: Unexpected internal error, please contact Databricks support.)."

From the event log I get this message: ("Cluster terminated. Reason: Container launch failure. An unexpected error was encountered while launching containers on worker instances for the cluster. Please retry and contact Databricks if the problem persists.")

The error usually occurs every 4-5 days and there are no job logs. What is also strange is that the run duration is still the same as if the notebook did run. Anyone ever run into this issue before?

9 REPLIES 9

Anonymous
Not applicable

@Eli Kvarfordtโ€‹ :

This error message indicates that the Spark containers failed to launch on the worker instances for the cluster, which can happen for a number of reasons, including issues with the underlying infrastructure or configuration issues. Here are some steps you can take to troubleshoot and resolve the issue:

  1. Check the status of the worker instances in your cluster. Make sure that they are all up and running and that there are no issues with the underlying infrastructure. You can also check the logs of the worker instances to see if there are any errors or issues that might be causing the problem.
  2. Check the configuration of your cluster. Make sure that the configuration is correct and that there are no errors or inconsistencies. You can also try changing the configuration and see if that resolves the issue.
  3. Try restarting the cluster. Sometimes, restarting the cluster can resolve the issue. Make sure to save any important data before restarting the cluster.

It's also worth noting that if the run duration is still the same as if the notebook did run, it's possible that the notebook did actually run and complete, but the logs were not saved due to the error.

elikvar
New Contributor III

Hi suteja, thanks for the response. Unfortunately I've tried these already and everything looks normal. I have many other jobs that run different configs and a lot of them use the same one this one does but for some reason this one occasionally fails. Could it be possible that the task the job is doing could interfere with how the cluster is behaving? If my understanding is correct the cluster will autoscale to meet demand, maybe something about the job is causing the cluster to provision resources in a strange way? However, if this is the case I wonder why I don't have any databricks logs at all, it doesnt even show that the first cell ran which is just setting variables.

Anonymous
Not applicable

@Eli Kvarfordtโ€‹ :

It's certainly possible that the task your job is performing is causing issues with the cluster. For example, if the job is using up a lot of resources or generating a lot of network traffic, it could be impacting the performance of the cluster or causing it to provision resources in unexpected ways.

One thing you could try is to monitor the cluster's resource usage while the job is running, and see if there are any spikes or unusual patterns that could be related to the failure. You can use the Databricks cluster metrics dashboard to monitor the cluster's CPU, memory, and network usage in real time.

As for the lack of logs, it's possible that the failure is happening too early in the job execution process for logs to be generated. If the first cell of your notebook isn't even running, it could be that the notebook itself is failing to launch or the cluster is terminating before it even gets to the first cell. In this case, it might be helpful to try running the notebook manually outside of the job scheduler to see if you can reproduce the issue and get more information on what's happening.

Anonymous
Not applicable

Hi @Eli Kvarfordtโ€‹ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

elikvar
New Contributor III

Hi @Kaniz Fatmaโ€‹ @Vidula Khannaโ€‹ @Suteja Kanuriโ€‹, I was not able to resolve the issue. I was monitoring it for a little while and it was behaving fine but it failed again today with the same issue. The only logs I have are from the Event log of the cluster shown below.

One thing I did notice is that the previous run which triggers 7 hours before had the same ADD_NODES_FAILED event but was able to eventually add the nodes and run the job. This makes me think their might be a race condition someone in the startup of the cluster? I'm not to sure how that sequence works. I also attached the first run event logs.

elikvar
New Contributor III

Heres the first run event log:

Sanm
New Contributor II

Hi @elikvar , Did you get any solution for this error. I am also getting same error "failed to add 3 containers " on cluster create and cluster start. Getting this error every time. โ€‹i get this error and cluster ia terminated automatically.

Lebreton
New Contributor II

hello

any update on this issue ?

We have the same problem and no logs to investigate (even in dbfs when we activate the logging)

Unexpected failure while waiting for the cluster (<id of our cluster>) to be ready: Cluster <id of our cluster> is in unexpected state Terminated: CONTAINER_LAUNCH_FAILURE(SERVICE_FAULT): instance_id:x-xxxxxxxx,databricks_error_message:Failed to launch spark container on instance x-xxxxxxx. Exception: Unexpected internal error, please contact Databricks support

Pavan578
New Contributor II

Cluster 'xxxxxxx' was terminated. Reason: WORKER_SETUP_FAILURE (SERVICE_FAULT). Parameters: databricks_error_message:DBFS Daemomn is not reachable., gcp_error_message:Unable to reach the colocated DBFS Daemon.

Can Anyone help me how can we resolve this issue.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group