03-13-2023 10:48 AM
I have a daily running notebook that occasionally fails with the error:
"Run result unavailable: job failed with error message
Unexpected failure while waiting for the cluster Some((xxxxxxxxxxxxxxx) )to be readySome(: Cluster xxxxxxxxxxxxxxxx is in unexpected state Terminated: CONTAINER_LAUNCH_FAILURE(SERVICE_FAULT)instance_id:ixxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx,databricks_error_message:Failed to launch spark container on instance i-xxxxxxxxxxxxxxxxxxxx. Exception: Unexpected internal error, please contact Databricks support.)."
From the event log I get this message: ("Cluster terminated. Reason: Container launch failure. An unexpected error was encountered while launching containers on worker instances for the cluster. Please retry and contact Databricks if the problem persists.")
The error usually occurs every 4-5 days and there are no job logs. What is also strange is that the run duration is still the same as if the notebook did run. Anyone ever run into this issue before?
03-14-2023 01:41 AM
@Eli Kvarfordt :
This error message indicates that the Spark containers failed to launch on the worker instances for the cluster, which can happen for a number of reasons, including issues with the underlying infrastructure or configuration issues. Here are some steps you can take to troubleshoot and resolve the issue:
It's also worth noting that if the run duration is still the same as if the notebook did run, it's possible that the notebook did actually run and complete, but the logs were not saved due to the error.
03-14-2023 03:32 PM
Hi suteja, thanks for the response. Unfortunately I've tried these already and everything looks normal. I have many other jobs that run different configs and a lot of them use the same one this one does but for some reason this one occasionally fails. Could it be possible that the task the job is doing could interfere with how the cluster is behaving? If my understanding is correct the cluster will autoscale to meet demand, maybe something about the job is causing the cluster to provision resources in a strange way? However, if this is the case I wonder why I don't have any databricks logs at all, it doesnt even show that the first cell ran which is just setting variables.
03-14-2023 04:35 PM
@Eli Kvarfordt :
It's certainly possible that the task your job is performing is causing issues with the cluster. For example, if the job is using up a lot of resources or generating a lot of network traffic, it could be impacting the performance of the cluster or causing it to provision resources in unexpected ways.
One thing you could try is to monitor the cluster's resource usage while the job is running, and see if there are any spikes or unusual patterns that could be related to the failure. You can use the Databricks cluster metrics dashboard to monitor the cluster's CPU, memory, and network usage in real time.
As for the lack of logs, it's possible that the failure is happening too early in the job execution process for logs to be generated. If the first cell of your notebook isn't even running, it could be that the notebook itself is failing to launch or the cluster is terminating before it even gets to the first cell. In this case, it might be helpful to try running the notebook manually outside of the job scheduler to see if you can reproduce the issue and get more information on what's happening.
03-18-2023 09:59 PM
Hi @Eli Kvarfordt
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!
03-23-2023 02:46 PM
Hi @Kaniz Fatma @Vidula Khanna @Suteja Kanuri, I was not able to resolve the issue. I was monitoring it for a little while and it was behaving fine but it failed again today with the same issue. The only logs I have are from the Event log of the cluster shown below.
One thing I did notice is that the previous run which triggers 7 hours before had the same ADD_NODES_FAILED event but was able to eventually add the nodes and run the job. This makes me think their might be a race condition someone in the startup of the cluster? I'm not to sure how that sequence works. I also attached the first run event logs.
03-23-2023 02:47 PM
Heres the first run event log:
04-16-2023 06:09 AM
Hi @elikvar , Did you get any solution for this error. I am also getting same error "failed to add 3 containers " on cluster create and cluster start. Getting this error every time. i get this error and cluster ia terminated automatically.
06-13-2023 07:28 AM
hello
any update on this issue ?
We have the same problem and no logs to investigate (even in dbfs when we activate the logging)
Unexpected failure while waiting for the cluster (<id of our cluster>) to be ready: Cluster <id of our cluster> is in unexpected state Terminated: CONTAINER_LAUNCH_FAILURE(SERVICE_FAULT): instance_id:x-xxxxxxxx,databricks_error_message:Failed to launch spark container on instance x-xxxxxxx. Exception: Unexpected internal error, please contact Databricks support
14 hours ago
Cluster 'xxxxxxx' was terminated. Reason: WORKER_SETUP_FAILURE (SERVICE_FAULT). Parameters: databricks_error_message:DBFS Daemomn is not reachable., gcp_error_message:Unable to reach the colocated DBFS Daemon.
Can Anyone help me how can we resolve this issue.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group