Hi @rpaschenko, The failures you experienced on September 28-29 could be due to various reasons.
For the jobs that could not start for some time and then failed with an error, there appears to be a timeout issue while initializing the instance. This could be due to network connectivity issues between the data and control planes. It's also possible that the instance was terminated unexpectedly. You might want to check the network connectivity and the instance's status.
As for the jobs that failed with the event "Failed to add 16 containers to the compute" could be due to a container launch failure. This could happen due to various reasons, such as resource constraints, network issues, or problems with the container image.
To troubleshoot these issues, you might want to:
- Review the logs for more details about the errors.
- Check the network connectivity between the data plane and the control plane.
- Check the status of the instances and containers.
- Review the job configurations and the resources allocated to the jobs.