Databricks Community

edouardtouze · ‎02-14-2025

Hi Databricks Community,

I’m currently facing several challenges with my Databricks clusters running on Google Kubernetes Engine (GKE). I hope someone here might have insights or suggestions to resolve the issues.

Problem Overview:

I am experiencing frequent scaling failures and network issues in my GKE cluster, which is affecting my Databricks environment. These issues started happening recently, and I’ve identified multiple related problems that are hindering performance.

Key Issues:

Scaling Failures:
- I’ve noticed that the Cluster Autoscaler API has been throwing "no.scale.up.mig.failing.predicate" errors, meaning it is unable to scale up node pools properly. The logs indicate that nodes don’t meet the node affinity rules set for the pods, resulting in unscheduled pods and scaling failures. The error message I see is:
  - "Node(s) didn't match Pod's node affinity/selector".
- The scaling failures involve multiple Managed Instance Groups (MIGs) across various zones, such as europe-west1-d, europe-west1-b, and europe-west1-c.
Frequent Kubelet Restarts:
- I’m also facing an issue with Frequent Kubelet Restarts, which seems to be leading to instability in the cluster. This is resulting in node disruption, further affecting the scaling process and causing intermittent downtime.
Network Issues:
- The network within my GKE cluster is not functioning properly. I’m seeing errors related to the CNI plugin (Calico), which prevents pods from communicating properly, leading to issues with scaling, pod eviction, and overall cluster stability.
Pod Disruption Budget (PDB) Conflicts:
- The current Pod Disruption Budget settings are too restrictive, causing failures when attempting to scale down or evict pods. This is likely compounded by network issues that prevent proper pod management.

Note that I can't access the GKE directly with kubectl as it is restricted (see FAQ of this documentation). Also I am not proficient in Kubernetes management, all the above issue were highlighted with the help of chatGPT ( by passing the error logs) in a attempt to understand what was happening on the GKE.

Thanks you,
Edouard

Alberto_Umana · ‎02-15-2025

Hello @edouardtouze, Could you please share the clusteID and the timestamp when the issue was observed?

edouardtouze · ‎02-17-2025

Hello @Alberto_Umana,

Thank you for your reply, here the infos you asked:

clusterID (name) : db-3451421062342009-3-0414-082538-944
timestamp of first observed issue : 2025-02-12T18:33:31Z

It seems that the error occurred before but not at the same scale nor the same frequency see image below:

Thank you

chalkboardbrad · ‎02-18-2025

I am having similar issues. first time I am using the `databricks_cluster` resource, my terraform apply does not gracefully complete, and I see numerous errors about:

1. Can’t scale up a node pool because of a failing scheduling predicate

The autoscaler was waiting for an ephemeral volume controller to create a PersistentVolumeClaim (PVC) before it could schedule the pod.

This is happening on an executor pod

2. Pod is blocking scale down because it doesn’t have enough Pod Disruption Budget (PDB)

Although this is more minor of an issue.

Thanks in advance.