<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Databricks on GCP with GKE | Cluster stuck in starting status | GKE allocation ressource failing in Administration &amp; Architecture</title>
    <link>https://community.databricks.com/t5/administration-architecture/databricks-on-gcp-with-gke-cluster-stuck-in-starting-status-gke/m-p/110492#M3018</link>
    <description>&lt;P&gt;I am having similar issues. first time I am using the `databricks_cluster` resource, my terraform apply does not gracefully complete, and I see numerous errors about:&lt;/P&gt;&lt;P&gt;1.&amp;nbsp;&lt;SPAN&gt;Can’t scale up a node pool because of a failing scheduling predicate&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;The autoscaler was &lt;STRONG&gt;waiting for an ephemeral volume controller to create a PersistentVolumeClaim (PVC)&lt;/STRONG&gt; before it could schedule the pod.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;This is happening on an&amp;nbsp;executor pod&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;2.&amp;nbsp;&lt;SPAN&gt;Pod is blocking scale down because it doesn’t have enough Pod Disruption Budget (PDB)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Although this is more minor of an issue.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Thanks in advance.&lt;/P&gt;</description>
    <pubDate>Tue, 18 Feb 2025 14:25:28 GMT</pubDate>
    <dc:creator>chalkboardbrad</dc:creator>
    <dc:date>2025-02-18T14:25:28Z</dc:date>
    <item>
      <title>Databricks on GCP with GKE | Cluster stuck in starting status | GKE allocation ressource failing</title>
      <link>https://community.databricks.com/t5/administration-architecture/databricks-on-gcp-with-gke-cluster-stuck-in-starting-status-gke/m-p/110198#M2994</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Hi Databricks Community,&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I’m currently facing several challenges with my Databricks clusters running on Google Kubernetes Engine (GKE). I hope someone here might have insights or suggestions to resolve the issues.&lt;/P&gt;&lt;H3&gt;&lt;STRONG&gt;Problem Overview:&lt;/STRONG&gt;&lt;/H3&gt;&lt;P&gt;I am experiencing frequent &lt;STRONG&gt;scaling failures&lt;/STRONG&gt; and &lt;STRONG&gt;network issues&lt;/STRONG&gt; in my GKE cluster, which is affecting my Databricks environment. These issues started happening recently, and I’ve identified multiple related problems that are hindering performance.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;H3&gt;&lt;STRONG&gt;Key Issues:&lt;/STRONG&gt;&lt;/H3&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Scaling Failures:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;I’ve noticed that the &lt;STRONG&gt;Cluster Autoscaler API&lt;/STRONG&gt; has been throwing &lt;STRONG&gt;"no.scale.up.mig.failing.predicate"&lt;/STRONG&gt; errors, meaning it is unable to scale up node pools properly. The logs indicate that nodes don’t meet the &lt;STRONG&gt;node affinity&lt;/STRONG&gt; rules set for the pods, resulting in unscheduled pods and scaling failures. The error message I see is:&lt;UL&gt;&lt;LI&gt;"Node(s) didn't match Pod's node affinity/selector".&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;The scaling failures involve multiple &lt;STRONG&gt;Managed Instance Groups (MIGs)&lt;/STRONG&gt; across various zones, such as europe-west1-d, europe-west1-b, and europe-west1-c.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Frequent Kubelet Restarts:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;I’m also facing an issue with &lt;STRONG&gt;Frequent Kubelet Restarts&lt;/STRONG&gt;, which seems to be leading to instability in the cluster. This is resulting in node disruption, further affecting the scaling process and causing intermittent downtime.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Network Issues:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;The network within my GKE cluster is not functioning properly. I’m seeing errors related to the &lt;STRONG&gt;CNI plugin (Calico)&lt;/STRONG&gt;, which prevents pods from communicating properly, leading to issues with scaling, pod eviction, and overall cluster stability.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Pod Disruption Budget (PDB) Conflicts:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;The current &lt;STRONG&gt;Pod Disruption Budget&lt;/STRONG&gt; settings are too restrictive, causing failures when attempting to scale down or evict pods. This is likely compounded by network issues that prevent proper pod management.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Note that I can't access the GKE directly with&amp;nbsp;kubectl&amp;nbsp;as it is restricted (see FAQ of this &lt;A href="https://www.databricks.com/blog/2021/08/06/how-we-built-databricks-on-google-kubernetes-engine-gke.html" target="_self"&gt;documentation&lt;/A&gt;). Also I am not&amp;nbsp;&lt;SPAN&gt;proficient in Kubernetes management, all the above issue were&amp;nbsp;highlighted with the help of chatGPT ( by passing the error logs) in a attempt to understand what was happening on the GKE.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Thanks you,&lt;BR /&gt;Edouard&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 14 Feb 2025 12:01:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/databricks-on-gcp-with-gke-cluster-stuck-in-starting-status-gke/m-p/110198#M2994</guid>
      <dc:creator>edouardtouze</dc:creator>
      <dc:date>2025-02-14T12:01:29Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks on GCP with GKE | Cluster stuck in starting status | GKE allocation ressource failing</title>
      <link>https://community.databricks.com/t5/administration-architecture/databricks-on-gcp-with-gke-cluster-stuck-in-starting-status-gke/m-p/110318#M3007</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/149236"&gt;@edouardtouze&lt;/a&gt;, Could you please share the clusteID and the timestamp when the issue was observed?&lt;/P&gt;</description>
      <pubDate>Sun, 16 Feb 2025 00:49:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/databricks-on-gcp-with-gke-cluster-stuck-in-starting-status-gke/m-p/110318#M3007</guid>
      <dc:creator>Alberto_Umana</dc:creator>
      <dc:date>2025-02-16T00:49:46Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks on GCP with GKE | Cluster stuck in starting status | GKE allocation ressource failing</title>
      <link>https://community.databricks.com/t5/administration-architecture/databricks-on-gcp-with-gke-cluster-stuck-in-starting-status-gke/m-p/110388#M3008</link>
      <description>&lt;P&gt;Hello &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/106294"&gt;@Alberto_Umana&lt;/a&gt;,&lt;BR /&gt;&lt;BR /&gt;Thank you for your reply, here the infos you asked:&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;clusterID (name) :&amp;nbsp;&lt;SPAN&gt;&lt;STRONG&gt;db-3451421062342009-3-0414-082538-944&lt;/STRONG&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;BR /&gt;timestamp of first observed issue :&lt;SPAN class=""&gt;&lt;STRONG&gt;&amp;nbsp;2025-02-12T18:33:31Z&lt;/STRONG&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN class=""&gt;&lt;BR /&gt;It seems that the error occurred before but not at the same scale nor the same frequency see image below:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="edouardtouze_0-1739793831720.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14898i74C47D67441215F7/image-size/medium?v=v2&amp;amp;px=400" role="button" title="edouardtouze_0-1739793831720.png" alt="edouardtouze_0-1739793831720.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Thank you&lt;/P&gt;</description>
      <pubDate>Mon, 17 Feb 2025 12:07:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/databricks-on-gcp-with-gke-cluster-stuck-in-starting-status-gke/m-p/110388#M3008</guid>
      <dc:creator>edouardtouze</dc:creator>
      <dc:date>2025-02-17T12:07:43Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks on GCP with GKE | Cluster stuck in starting status | GKE allocation ressource failing</title>
      <link>https://community.databricks.com/t5/administration-architecture/databricks-on-gcp-with-gke-cluster-stuck-in-starting-status-gke/m-p/110492#M3018</link>
      <description>&lt;P&gt;I am having similar issues. first time I am using the `databricks_cluster` resource, my terraform apply does not gracefully complete, and I see numerous errors about:&lt;/P&gt;&lt;P&gt;1.&amp;nbsp;&lt;SPAN&gt;Can’t scale up a node pool because of a failing scheduling predicate&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;The autoscaler was &lt;STRONG&gt;waiting for an ephemeral volume controller to create a PersistentVolumeClaim (PVC)&lt;/STRONG&gt; before it could schedule the pod.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;This is happening on an&amp;nbsp;executor pod&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;2.&amp;nbsp;&lt;SPAN&gt;Pod is blocking scale down because it doesn’t have enough Pod Disruption Budget (PDB)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Although this is more minor of an issue.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Thanks in advance.&lt;/P&gt;</description>
      <pubDate>Tue, 18 Feb 2025 14:25:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/databricks-on-gcp-with-gke-cluster-stuck-in-starting-status-gke/m-p/110492#M3018</guid>
      <dc:creator>chalkboardbrad</dc:creator>
      <dc:date>2025-02-18T14:25:28Z</dc:date>
    </item>
  </channel>
</rss>

