cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Trying to understand why a cluster reports as "terminating" right after being created

mrstevegross
Contributor III

We use a "warmup" mechanism to get our DBR instance pool into a state where it has at-least-N instances. The logic is:

  1. For N repetitions:
    1. Request a new DBR cluster in the pool (which causes the pool to request an AWS instance)
    2. Wait for the cluster to report as RUNNING
      1. If it reports as TERMINATING, abandon this iteration
    3. Terminating the DBR cluster (to free it up for an upcoming real request)

Normally, this works fine. Lately, however, there's a weird issue: we hit the 1.1.1 situation ("If it reports as TERMINATING, abandon this iteration") for *all* clusters. I have occasionally seen this for 1 cluster (of, say, 40), but never for ALL of them.

What could cause DBR to transition a cluster to "TERMINATING" right after it's created?

1 ACCEPTED SOLUTION

Accepted Solutions

mrstevegross
Contributor III

Aha, found it. I monitored the pool status via the DBR UI, and when a cluster *started* being provisioned, I clicked into it. Then I looked at the event log, and found useful information about failed steps. The underlying error was indeed AWS related (an issue in our role configuration).

View solution in original post

2 REPLIES 2

mrstevegross
Contributor III

I see that there is some documentation on the subject; I'm exploring whether AWS is actually the culprit.

mrstevegross
Contributor III

Aha, found it. I monitored the pool status via the DBR UI, and when a cluster *started* being provisioned, I clicked into it. Then I looked at the event log, and found useful information about failed steps. The underlying error was indeed AWS related (an issue in our role configuration).