cancel
Showing results for 
Search instead for 
Did you mean: 
DELETE
cancel
Showing results for 
Search instead for 
Did you mean: 

Cannot restart or create a new cluster on Databricks using Google Cloud Platform

96286
Contributor

I started a free trial with Databricks and everything was running perfectly. The trial ended on the 28th April and I am assuming I was simply transferred to the normal premium paid plan. I last used my general cluster on the 2nd May. Since coming back from a weeks holiday, I am unable to restart my general compute cluster. I tried deleting this cluster and creating a new one. I am stuck in the state of "Finding instances for new nodes, acquiring new instances if necessary" and have been for nearly 2 hours.

I currently have below the required minimum quota on GCP for n2_cpus. I have 24, and have requested an increase to 50 multiple times but have been rejected. I am assuming I need to prove I am hitting my current quota limit before GCP allow me to increase. I guessed this might be the issue, so I reduced the number of workers in my new cluster to max 4, however I am still unable to create a cluster (see image of cluster configuration). Despite being stuck in this "creating" state for some hours, I see no error on the Databricks side, and no clues to why this is happening. If i look at the event log all I see is CREATING as seen in the attached image.

Cluster config 

event log 

Nothing from my perspective has changed since the last time I used Databricks. This is a huge pain since I was planning on ramping up my activity on Databricks and now we are losing a great deal of productivity.

Any help would be greatly appreciated.

7 REPLIES 7

96286
Contributor

Just following on from my original question. It seems that CPU activity and GKE activity stopped on Sunday evening 7th May. If I look at the logs in GCP I see there is a delete cluster action. "google.container.v1.ClusterManager.DeleteCluster". This action did not come from me... is it possible that GCP deleted my cluster? If so how is this possible? What actions can I take to get Databricks running again?

96286
Contributor

Looking closer at my logs, the delection of the GKE was made by the databricks service account. This means the action must have come from databricks. My databricks workspace has not been deleted. Does anyone how this could have happened?

96286
Contributor

Ok. After hours of digging I see what has happened.

From the docs here:

  • The GKE cluster cost applies even if Databricks clusters are idle. To reduce this idle-time cost, Databricks deletes the GKE cluster in your account if no Databricks Runtime clusters are active for five days. Other resources, such as VPC and GCS buckets, remain unchanged. The next time a Databricks Runtime cluster starts, Databricks recreates the GKE cluster, which adds to the initial Databricks Runtime cluster launch time. For an example of how GKE cluster deletion reduces monthly costs, let’s say you used a Databricks Runtime cluster on the first of the month but not again for the rest of the month: your GKE usage would be the five days before the idle timeout takes effect and nothing more, costing approximately $33 for the month.

The issue is, now that I want to create a new cluster in databricks, it seems databricks is unable to recreate the GKE cluster? It should not take more than an hour to recreate the GKE...?

96286
Contributor

When the databricks service account makes the request to create GKE I need multiple "Internal Error" messages in logs explorer.

karthik_p
Esteemed Contributor

@Oliver Paul​ Every time you create cluster in GCP databricks that uses GKE and containers to spin up your clusters. it is bit different to what we have in azure and aws.

if you are facing quota issues, there is nothing to do with GKE, can you please check your GCP plan and request for quota increase and confirm with GCP team on quota issue.

if they say any limit is there on requested instance type, try to create your databricks on another region and see if you are still facing same issue

Thanks @karthik p​ . What does not make any sense to me is that I was already able to create GKE with no issues when I started my workspace a new weeks ago. Yet now I get internal errors on the GCP side when databricks tries to recreate it after inactivity.

karthik_p
Esteemed Contributor

@Oliver Paul​ did you get a chance to chance logs in GCP level by going to GCP instance that got created when you start cluster, you can go by cluster ID under tags in Databricks cluster or u can see ip adress of cluster in spark UI

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.