Databricks on GCP with GKE | Cluster stuck in starting status | GKE allocation ressource failing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ā02-14-2025 04:01 AM
Hi Databricks Community,
Iām currently facing several challenges with my Databricks clusters running on Google Kubernetes Engine (GKE). I hope someone here might have insights or suggestions to resolve the issues.
Problem Overview:
I am experiencing frequent scaling failures and network issues in my GKE cluster, which is affecting my Databricks environment. These issues started happening recently, and Iāve identified multiple related problems that are hindering performance.
Key Issues:
Scaling Failures:
- Iāve noticed that the Cluster Autoscaler API has been throwing "no.scale.up.mig.failing.predicate" errors, meaning it is unable to scale up node pools properly. The logs indicate that nodes donāt meet the node affinity rules set for the pods, resulting in unscheduled pods and scaling failures. The error message I see is:
- "Node(s) didn't match Pod's node affinity/selector".
- The scaling failures involve multiple Managed Instance Groups (MIGs) across various zones, such as europe-west1-d, europe-west1-b, and europe-west1-c.
- Iāve noticed that the Cluster Autoscaler API has been throwing "no.scale.up.mig.failing.predicate" errors, meaning it is unable to scale up node pools properly. The logs indicate that nodes donāt meet the node affinity rules set for the pods, resulting in unscheduled pods and scaling failures. The error message I see is:
Frequent Kubelet Restarts:
- Iām also facing an issue with Frequent Kubelet Restarts, which seems to be leading to instability in the cluster. This is resulting in node disruption, further affecting the scaling process and causing intermittent downtime.
Network Issues:
- The network within my GKE cluster is not functioning properly. Iām seeing errors related to the CNI plugin (Calico), which prevents pods from communicating properly, leading to issues with scaling, pod eviction, and overall cluster stability.
Pod Disruption Budget (PDB) Conflicts:
- The current Pod Disruption Budget settings are too restrictive, causing failures when attempting to scale down or evict pods. This is likely compounded by network issues that prevent proper pod management.
Note that I can't access the GKE directly with kubectl as it is restricted (see FAQ of this documentation). Also I am not proficient in Kubernetes management, all the above issue were highlighted with the help of chatGPT ( by passing the error logs) in a attempt to understand what was happening on the GKE.
Thanks you,
Edouard
- Labels:
-
Partner
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ā02-15-2025 04:49 PM
Hello @edouardtouze, Could you please share the clusteID and the timestamp when the issue was observed?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ā02-17-2025 04:07 AM
Hello @Alberto_Umana,
Thank you for your reply, here the infos you asked:
clusterID (name) : db-3451421062342009-3-0414-082538-944
timestamp of first observed issue : 2025-02-12T18:33:31Z
It seems that the error occurred before but not at the same scale nor the same frequency see image below:
Thank you
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ā02-18-2025 06:25 AM
I am having similar issues. first time I am using the `databricks_cluster` resource, my terraform apply does not gracefully complete, and I see numerous errors about:
1. Canāt scale up a node pool because of a failing scheduling predicate
The autoscaler was waiting for an ephemeral volume controller to create a PersistentVolumeClaim (PVC) before it could schedule the pod.
This is happening on an executor pod
2. Pod is blocking scale down because it doesnāt have enough Pod Disruption Budget (PDB)
Although this is more minor of an issue.
Thanks in advance.

