cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Databricks Free Trial Help
Engage in discussions about the Databricks Free Trial within the Databricks Community. Share insights, tips, and best practices for getting started, troubleshooting issues, and maximizing the value of your trial experience to explore Databricks' capabilities effectively.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Run failed with error message Unexpected failure while waiting for the cluster

vishalv4476
New Contributor III

Hi All, 

Recently all of my databricks jobs are failing due to error:

Run failed with error message Unexpected failure while waiting for the cluster (xxxx-xxxxxx-xxxxxxxx) to be ready: Cluster '(xxxx-xxxxxx-xxxxxxxx) is unhealthy.

it was all running till 21st july but started failing on 22nd july both in dev and prod env

1 ACCEPTED SOLUTION

Accepted Solutions

Hi @vishalv4476 ,

Could you try adding the -y flag to ensure non-interactive installation: sudo apt -y install jq || echo 'Warning: Failed to install jq'

View solution in original post

5 REPLIES 5

Advika
Databricks Employee
Databricks Employee

Hello @vishalv4476!

Could you please check the Event Log details for the failed clusters? They can help identify if the issue is related to the driver, network, libraries, or any init scripts. It's also possible the problem is due to changes made between these two dates.

vishalv4476
New Contributor III

Hi,

Issue potentially is with init scripts , it starts and run for 50 mins and terminates the clusters to execute the jobs

Global init script we are using is :

#!/bin/bash

# To diable immediate exit or errors
set +e

curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash || echo "Warning: Azure CLI installation failed."

sudo apt install jq || echo "Warning: Failed to install jq."

export resource_id=$(curl -s -H Metadata:true --noproxy "*" "some-api-xxx-xxx" | jq -r .compute.resourceId) || echo "Warning: Failed to fetch resource ID or resource ID is empty."

export msi_id="xxx-xxx-xxx-xxx"
export dcr_id="/subscriptions/xxxx-xxxx-xxxx-xxxx/resourceGroups/xx-xx-xx-xx/providers/Microsoft.Insights/dataCollectionRules/xx-xx-xx-xx"

az login --identity --username ${msi_id} || echo "Warning: Azure CLI login failed."
echo "Installing AMA for VM-${resource_id}"
az vm extension set --name AzureMonitorLinuxAgent --publisher Microsoft.Azure.Monitor --ids $resource_id --enable-auto-upgrade true || echo "Warning: Failed to install Azure Monitor Agent."
echo "Enabling DCRA for VM-${resource_id}"
az monitor data-collection rule association create --name "VM-4" --rule-id $dcr_id --resource $resource_id || echo "Warning: Failed to enable DCRA for VM-${resource_id}."

echo "Global init script completed."
exit 0

vishalv4476
New Contributor III

Event logs from job run:

vishalv4476_0-1753271123242.png

 

Hi @vishalv4476 ,

Could you try adding the -y flag to ensure non-interactive installation: sudo apt -y install jq || echo 'Warning: Failed to install jq'

vishalv4476
New Contributor III

My bad , just a silly thing to miss out. Really appreciate your help @SP_6721 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now