cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks on GCP with GKE | Cluster stuck in starting status | GKE allocation ressource failing

edouardtouze
New Contributor II

Hi Databricks Community,

Iā€™m currently facing several challenges with my Databricks clusters running on Google Kubernetes Engine (GKE). I hope someone here might have insights or suggestions to resolve the issues.

Problem Overview:

I am experiencing frequent scaling failures and network issues in my GKE cluster, which is affecting my Databricks environment. These issues started happening recently, and Iā€™ve identified multiple related problems that are hindering performance.

Key Issues:

  1. Scaling Failures:

    • Iā€™ve noticed that the Cluster Autoscaler API has been throwing "no.scale.up.mig.failing.predicate" errors, meaning it is unable to scale up node pools properly. The logs indicate that nodes donā€™t meet the node affinity rules set for the pods, resulting in unscheduled pods and scaling failures. The error message I see is:
      • "Node(s) didn't match Pod's node affinity/selector".
    • The scaling failures involve multiple Managed Instance Groups (MIGs) across various zones, such as europe-west1-d, europe-west1-b, and europe-west1-c.
  2. Frequent Kubelet Restarts:

    • Iā€™m also facing an issue with Frequent Kubelet Restarts, which seems to be leading to instability in the cluster. This is resulting in node disruption, further affecting the scaling process and causing intermittent downtime.
  3. Network Issues:

    • The network within my GKE cluster is not functioning properly. Iā€™m seeing errors related to the CNI plugin (Calico), which prevents pods from communicating properly, leading to issues with scaling, pod eviction, and overall cluster stability.
  4. Pod Disruption Budget (PDB) Conflicts:

    • The current Pod Disruption Budget settings are too restrictive, causing failures when attempting to scale down or evict pods. This is likely compounded by network issues that prevent proper pod management.

Note that I can't access the GKE directly with kubectl as it is restricted (see FAQ of this documentation). Also I am not proficient in Kubernetes management, all the above issue were highlighted with the help of chatGPT ( by passing the error logs) in a attempt to understand what was happening on the GKE. 

Thanks you,
Edouard 

3 REPLIES 3

Alberto_Umana
Databricks Employee
Databricks Employee

Hello @edouardtouze, Could you please share the clusteID and the timestamp when the issue was observed?

Hello @Alberto_Umana,

Thank you for your reply, here the infos you asked: 

clusterID (name) : db-3451421062342009-3-0414-082538-944 
timestamp of first observed issue : 2025-02-12T18:33:31Z 


It seems that the error occurred before but not at the same scale nor the same frequency see image below:

 

edouardtouze_0-1739793831720.png


Thank you

chalkboardbrad
New Contributor II

I am having similar issues. first time I am using the `databricks_cluster` resource, my terraform apply does not gracefully complete, and I see numerous errors about:

1. Canā€™t scale up a node pool because of a failing scheduling predicate

The autoscaler was waiting for an ephemeral volume controller to create a PersistentVolumeClaim (PVC) before it could schedule the pod.

This is happening on an executor pod

2. Pod is blocking scale down because it doesnā€™t have enough Pod Disruption Budget (PDB)

Although this is more minor of an issue.

Thanks in advance.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityā€”sign up today to get started!

Sign Up Now