05-12-2023 01:25 AM
I started a free trial with Databricks and everything was running perfectly. The trial ended on the 28th April and I am assuming I was simply transferred to the normal premium paid plan. I last used my general cluster on the 2nd May. Since coming back from a weeks holiday, I am unable to restart my general compute cluster. I tried deleting this cluster and creating a new one. I am stuck in the state of "Finding instances for new nodes, acquiring new instances if necessary" and have been for nearly 2 hours.
I currently have below the required minimum quota on GCP for n2_cpus. I have 24, and have requested an increase to 50 multiple times but have been rejected. I am assuming I need to prove I am hitting my current quota limit before GCP allow me to increase. I guessed this might be the issue, so I reduced the number of workers in my new cluster to max 4, however I am still unable to create a cluster (see image of cluster configuration). Despite being stuck in this "creating" state for some hours, I see no error on the Databricks side, and no clues to why this is happening. If i look at the event log all I see is CREATING as seen in the attached image.
Nothing from my perspective has changed since the last time I used Databricks. This is a huge pain since I was planning on ramping up my activity on Databricks and now we are losing a great deal of productivity.
Any help would be greatly appreciated.
05-12-2023 03:44 AM
Just following on from my original question. It seems that CPU activity and GKE activity stopped on Sunday evening 7th May. If I look at the logs in GCP I see there is a delete cluster action. "google.container.v1.ClusterManager.DeleteCluster". This action did not come from me... is it possible that GCP deleted my cluster? If so how is this possible? What actions can I take to get Databricks running again?
05-12-2023 05:31 AM
Looking closer at my logs, the delection of the GKE was made by the databricks service account. This means the action must have come from databricks. My databricks workspace has not been deleted. Does anyone how this could have happened?
05-12-2023 05:51 AM
Ok. After hours of digging I see what has happened.
From the docs here:
The issue is, now that I want to create a new cluster in databricks, it seems databricks is unable to recreate the GKE cluster? It should not take more than an hour to recreate the GKE...?
05-12-2023 06:17 AM
When the databricks service account makes the request to create GKE I need multiple "Internal Error" messages in logs explorer.
05-12-2023 06:46 AM
@Oliver Paul Every time you create cluster in GCP databricks that uses GKE and containers to spin up your clusters. it is bit different to what we have in azure and aws.
if you are facing quota issues, there is nothing to do with GKE, can you please check your GCP plan and request for quota increase and confirm with GCP team on quota issue.
if they say any limit is there on requested instance type, try to create your databricks on another region and see if you are still facing same issue
05-12-2023 06:53 AM
Thanks @karthik p . What does not make any sense to me is that I was already able to create GKE with no issues when I started my workspace a new weeks ago. Yet now I get internal errors on the GCP side when databricks tries to recreate it after inactivity.
05-12-2023 10:16 AM
@Oliver Paul did you get a chance to chance logs in GCP level by going to GCP instance that got created when you start cluster, you can go by cluster ID under tags in Databricks cluster or u can see ip adress of cluster in spark UI
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group