NandiniN
Databricks Employee
Databricks Employee

The error message "BOOTSTRAP_TIMEOUT (SERVICE_FAULT)" indicates that the cluster was terminated because it took too long to initialize. This can happen due to various reasons, including network connectivity issues between the data plane and the control plane, or issues with the cloud provider's infrastructure.

Given the intermittent nature of the issue (1 in 10-20 runs), it might be challenging to pinpoint the exact cause. Monitoring the infrastructure and keeping track of when the errors occur can help identify any patterns or recurring issues.

I would suggest debugging it right when the issue is seen and checking all the logs and also check from the cloud provider.