<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Failed to add 3 workers to the compute. Will attempt retry: true. Reason: Driver unresponsive in Administration &amp; Architecture</title>
    <link>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104148#M2669</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/106294"&gt;@Alberto_Umana&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Here are the logs:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Standard_Output:&lt;/STRONG&gt;&amp;nbsp;&lt;A href="https://gist.github.com/ambigus9/c4c17ef936a2c5fb077e26b84498b50a" target="_blank"&gt;https://gist.github.com/ambigus9/c4c17ef936a2c5fb077e26b84498b50a&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Standard Error:&lt;/STRONG&gt;&amp;nbsp;&lt;A href="https://gist.github.com/ambigus9/b5ef9b8ef3171189e21efd659c67d2bd" target="_blank"&gt;https://gist.github.com/ambigus9/b5ef9b8ef3171189e21efd659c67d2bd&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Log4j Output:&lt;/STRONG&gt;&amp;nbsp;&lt;A href="https://gist.github.com/ambigus9/9911fd669d7ea914534c3a1d0cfd8dab" target="_blank"&gt;https://gist.github.com/ambigus9/9911fd669d7ea914534c3a1d0cfd8dab&lt;/A&gt;&lt;/P&gt;&lt;P&gt;The status of the EC2 VM's Launched looks fine:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_0-1735933788820.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13865iDD08CA9D72053230/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_0-1735933788820.png" alt="ambigus9_0-1735933788820.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 03 Jan 2025 19:50:05 GMT</pubDate>
    <dc:creator>ambigus9</dc:creator>
    <dc:date>2025-01-03T19:50:05Z</dc:date>
    <item>
      <title>Failed to add 3 workers to the compute. Will attempt retry: true. Reason: Driver unresponsive</title>
      <link>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104081#M2660</link>
      <description>&lt;P&gt;Currently I trying to Create a Compute Cluster on a Workspaces with Privatelink and Custom VPC.&lt;/P&gt;&lt;P&gt;I'm using Terraform:&amp;nbsp;&lt;A href="https://registry.terraform.io/providers/databricks/databricks/latest/docs/guides/aws-private-link-workspace" target="_blank"&gt;https://registry.terraform.io/providers/databricks/databricks/latest/docs/guides/aws-private-link-workspace&lt;/A&gt;&lt;/P&gt;&lt;P&gt;After the deployment is completed, I try to Create a Compute cluster but I'm getting following error:&lt;/P&gt;&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;Failed to add 3 workers to the compute. Will attempt retry: true. Reason: Driver unresponsive&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;Spark driver became unresponsive on startup. This issue can be caused by invalid Spark configurations or malfunctioning init scripts. Please refer to the Spark driver logs to troubleshoot this issue, and contact Databricks if the problem persists.&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;Internal error message: Spark failed to start: Driver unresponsive. Possible reasons: library conflicts, incorrect metastore configuration, and init script misconfiguration.&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;I tried everything: creating the S3 Gateway Endpoint, STS Interface Endpoint, Kinesis-Streams Interface Endpoint,&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_0-1735912629708.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13844iA3FEC2239640D090/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_0-1735912629708.png" alt="ambigus9_0-1735912629708.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;also in the Security Group I have the corresponding ports on Inbound and Outbound rules:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Security Group - Network Workspace - Inbound Rules&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_1-1735912708564.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13845i77ACC0D53425423C/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_1-1735912708564.png" alt="ambigus9_1-1735912708564.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Security Group - Network Workspace - Outbound Rules&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_2-1735912741139.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13846i4FB68A34ECF0873E/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_2-1735912741139.png" alt="ambigus9_2-1735912741139.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Any help will be appretiated. Thanks!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 03 Jan 2025 14:01:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104081#M2660</guid>
      <dc:creator>ambigus9</dc:creator>
      <dc:date>2025-01-03T14:01:43Z</dc:date>
    </item>
    <item>
      <title>Re: Failed to add 3 workers to the compute. Will attempt retry: true. Reason: Driver unresponsive</title>
      <link>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104082#M2661</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/133005"&gt;@ambigus9&lt;/a&gt;,&lt;/P&gt;
&lt;P class="p1"&gt;It seems like you are encountering issues with creating a compute cluster in a Databricks workspace configured with PrivateLink and a custom VPC using Terraform. The error message indicates that the Spark driver is becoming unresponsive on startup, which could be due to several reasons such as invalid Spark configurations, library conflicts, incorrect metastore configuration, or misconfigured init scripts.&lt;/P&gt;
&lt;P class="p1"&gt;Here are some steps you can take to troubleshoot and resolve the issue:&lt;/P&gt;
&lt;OL class="ol1"&gt;
&lt;LI class="li1"&gt;&lt;STRONG&gt;Check Spark Configurations and Init Scripts&lt;/STRONG&gt;:&lt;/LI&gt;
&lt;UL class="ul1"&gt;
&lt;LI class="li1"&gt;Review the Spark configurations and ensure they are correctly set up. Invalid configurations can cause the driver to become unresponsive.&lt;/LI&gt;
&lt;LI class="li1"&gt;Verify that the init scripts are correctly configured and do not contain errors that could prevent the Spark driver from starting.&lt;/LI&gt;
&lt;/UL&gt;
&lt;LI class="li1"&gt;&lt;STRONG&gt;Review Security Group Rules&lt;/STRONG&gt;:&lt;/LI&gt;
&lt;UL class="ul1"&gt;
&lt;LI class="li1"&gt;Ensure that the security group rules for both inbound and outbound traffic are correctly configured. The necessary ports (443, 2443, 6666, 8443, 8444, 8445-8451) should be open as required by Databricks.&lt;/LI&gt;
&lt;LI class="li1"&gt;Make sure that the security group allows traffic between the workspace subnets and the VPC endpoints.&lt;/LI&gt;
&lt;/UL&gt;
&lt;LI class="li1"&gt;&lt;STRONG&gt;Check VPC Endpoints&lt;/STRONG&gt;:&lt;/LI&gt;
&lt;UL class="ul1"&gt;
&lt;LI class="li1"&gt;Verify that the VPC endpoints for the workspace and secure cluster connectivity relay are correctly set up and associated with the appropriate subnets and security groups.&lt;/LI&gt;
&lt;LI class="li1"&gt;Ensure that the DNS hostnames and DNS resolution are enabled for the VPC.&lt;/LI&gt;
&lt;/UL&gt;
&lt;LI class="li1"&gt;&lt;STRONG&gt;Review Network ACLs&lt;/STRONG&gt;:&lt;/LI&gt;
&lt;UL class="ul1"&gt;
&lt;LI class="li1"&gt;Ensure that the network ACLs for the subnets allow bidirectional (outbound and inbound) rules for the necessary ports.&lt;/LI&gt;
&lt;/UL&gt;
&lt;LI class="li1"&gt;&lt;STRONG&gt;Check AWS Service Endpoints&lt;/STRONG&gt;:&lt;/LI&gt;
&lt;UL class="ul1"&gt;
&lt;LI class="li1"&gt;Ensure that the necessary AWS service endpoints (S3, STS, Kinesis) are correctly set up and accessible from the workspace subnets.&lt;/LI&gt;
&lt;/UL&gt;
&lt;LI class="li1"&gt;&lt;STRONG&gt;Review Spark Driver Logs&lt;/STRONG&gt;:&lt;/LI&gt;
&lt;UL class="ul1"&gt;
&lt;LI class="li1"&gt;Access the Spark driver logs to get more detailed information about the error. The logs can provide insights into what might be causing the driver to become unresponsive.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/OL&gt;</description>
      <pubDate>Fri, 03 Jan 2025 14:04:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104082#M2661</guid>
      <dc:creator>Alberto_Umana</dc:creator>
      <dc:date>2025-01-03T14:04:54Z</dc:date>
    </item>
    <item>
      <title>Re: Failed to add 3 workers to the compute. Will attempt retry: true. Reason: Driver unresponsive</title>
      <link>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104090#M2662</link>
      <description>&lt;P&gt;Thanks for you quickly answer! I'm curious about the Security Groups.&lt;/P&gt;&lt;P&gt;There are two security Groups: One that I must create and indicate it to Terraform and the second is Created By Terraform and the description is:&amp;nbsp;&lt;EM&gt;Data Plane VPC endpoint security group.&amp;nbsp;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;1) Which one must have&amp;nbsp;443, 2443, 6666, 8443, 8444, 8445-8451 ports opened?&lt;/P&gt;&lt;P&gt;2) Which should be the Destination?&lt;/P&gt;&lt;P&gt;I have this configuration:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Data Plane VPC endpoint security group&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_0-1735913915707.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13847i24FC7762867ED411/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_0-1735913915707.png" alt="ambigus9_0-1735913915707.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Security Group Created for Databricks Network (Workspace)&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_1-1735914100177.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13848i576E5B4268C9C63F/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_1-1735914100177.png" alt="ambigus9_1-1735914100177.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;There are correctly configurated?&lt;/P&gt;</description>
      <pubDate>Fri, 03 Jan 2025 14:25:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104090#M2662</guid>
      <dc:creator>ambigus9</dc:creator>
      <dc:date>2025-01-03T14:25:12Z</dc:date>
    </item>
    <item>
      <title>Re: Failed to add 3 workers to the compute. Will attempt retry: true. Reason: Driver unresponsive</title>
      <link>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104093#M2663</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/133005"&gt;@ambigus9&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;Have you reviewed driver logs of the cluster, that would give us a clue on what the root of the issue is.&lt;/P&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;SPAN&gt;The security group that must have ports 443, 2443, 6666, 8443, 8444, 8445-8451 opened is the one created by Terraform, which is described as the "Data Plane VPC endpoint security group."&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;SPAN&gt;2) The destination for these ports should be 0.0.0.0/0, which allows traffic to any destination. This is necessary for the Databricks infrastructure, cloud data sources, library repositories, secure cluster connectivity, and other internal Databricks services&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://docs.databricks.com/en/security/network/classic/privatelink.html" target="_blank"&gt;https://docs.databricks.com/en/security/network/classic/privatelink.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 03 Jan 2025 15:08:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104093#M2663</guid>
      <dc:creator>Alberto_Umana</dc:creator>
      <dc:date>2025-01-03T15:08:23Z</dc:date>
    </item>
    <item>
      <title>Re: Failed to add 3 workers to the compute. Will attempt retry: true. Reason: Driver unresponsive</title>
      <link>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104113#M2665</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/106294"&gt;@Alberto_Umana&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Data Plane VPC endpoint Security Group - Inbound Rules&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_1-1735923037660.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13854iAF5D21CEF78304B8/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_1-1735923037660.png" alt="ambigus9_1-1735923037660.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Data Plane VPC endpoint Security Group - Outbound Rules&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_0-1735922985738.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13853i38243F8592A37C80/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_0-1735922985738.png" alt="ambigus9_0-1735922985738.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Security Group Workspaces Network - Inbound Rules&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_2-1735923149645.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13855iC8BAF1FC743D4D1D/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_2-1735923149645.png" alt="ambigus9_2-1735923149645.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Security Group Workspaces Network - Outbound Rules&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_3-1735923195980.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13856iCC854FBDD8F53D23/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_3-1735923195980.png" alt="ambigus9_3-1735923195980.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;1) Are these correctly configurated?&lt;/P&gt;&lt;P&gt;2) I'm curious about the fact the EC2 Workers uses the&amp;nbsp;&lt;EM&gt;Security Group Workspaces Network&amp;nbsp;&lt;/EM&gt;as you can see in the following image:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_4-1735923329535.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13857iC5CEC9DF6B7B35E4/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_4-1735923329535.png" alt="ambigus9_4-1735923329535.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 03 Jan 2025 16:57:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104113#M2665</guid>
      <dc:creator>ambigus9</dc:creator>
      <dc:date>2025-01-03T16:57:19Z</dc:date>
    </item>
    <item>
      <title>Re: Failed to add 3 workers to the compute. Will attempt retry: true. Reason: Driver unresponsive</title>
      <link>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104126#M2667</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/133005"&gt;@ambigus9&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;Yeah it does look to be fine, also I see workerenv running, what failure do you see in the driver logs while launching a cluster?&lt;/P&gt;
&lt;P&gt;Also what is the status of the EC2 VM launched when cluster is spun up?&lt;/P&gt;</description>
      <pubDate>Fri, 03 Jan 2025 17:42:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104126#M2667</guid>
      <dc:creator>Alberto_Umana</dc:creator>
      <dc:date>2025-01-03T17:42:39Z</dc:date>
    </item>
    <item>
      <title>Re: Failed to add 3 workers to the compute. Will attempt retry: true. Reason: Driver unresponsive</title>
      <link>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104148#M2669</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/106294"&gt;@Alberto_Umana&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Here are the logs:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Standard_Output:&lt;/STRONG&gt;&amp;nbsp;&lt;A href="https://gist.github.com/ambigus9/c4c17ef936a2c5fb077e26b84498b50a" target="_blank"&gt;https://gist.github.com/ambigus9/c4c17ef936a2c5fb077e26b84498b50a&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Standard Error:&lt;/STRONG&gt;&amp;nbsp;&lt;A href="https://gist.github.com/ambigus9/b5ef9b8ef3171189e21efd659c67d2bd" target="_blank"&gt;https://gist.github.com/ambigus9/b5ef9b8ef3171189e21efd659c67d2bd&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Log4j Output:&lt;/STRONG&gt;&amp;nbsp;&lt;A href="https://gist.github.com/ambigus9/9911fd669d7ea914534c3a1d0cfd8dab" target="_blank"&gt;https://gist.github.com/ambigus9/9911fd669d7ea914534c3a1d0cfd8dab&lt;/A&gt;&lt;/P&gt;&lt;P&gt;The status of the EC2 VM's Launched looks fine:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_0-1735933788820.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13865iDD08CA9D72053230/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_0-1735933788820.png" alt="ambigus9_0-1735933788820.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 03 Jan 2025 19:50:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104148#M2669</guid>
      <dc:creator>ambigus9</dc:creator>
      <dc:date>2025-01-03T19:50:05Z</dc:date>
    </item>
    <item>
      <title>Re: Failed to add 3 workers to the compute. Will attempt retry: true. Reason: Driver unresponsive</title>
      <link>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104153#M2670</link>
      <description>&lt;P&gt;Thanks for the details.&lt;/P&gt;
&lt;P&gt;Can you ensure that the network connection to the metastore database is stable and that there are no firewall rules or security groups blocking access to the database. You can use the &lt;CODE&gt;nc&lt;/CODE&gt; command to verify connectivity to the database host and port.&lt;/P&gt;
&lt;P&gt;You can see it here:&lt;/P&gt;
&lt;P class="p1"&gt;25/01/03 18:59:40 WARN MetastoreMonitor: Failed to connect to the metastore InternalMysqlMetastore(DbMetastoreConfig{host=mdpartyyphlhsp.caj77bnxuhme.us-west-2.rds.amazonaws.com, port=3306, dbName=organization2149045078433955, user=f7tWV573MJqOHYAs}). (timeSinceLastSuccess=0)&lt;/P&gt;
&lt;P class="p1"&gt;From a notebook you can do nc command to the RDS above and port!&lt;/P&gt;</description>
      <pubDate>Fri, 03 Jan 2025 21:42:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104153#M2670</guid>
      <dc:creator>Alberto_Umana</dc:creator>
      <dc:date>2025-01-03T21:42:12Z</dc:date>
    </item>
    <item>
      <title>Re: Failed to add 3 workers to the compute. Will attempt retry: true. Reason: Driver unresponsive</title>
      <link>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104523#M2697</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/106294"&gt;@Alberto_Umana&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;After running the command I getting connection timeout:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_0-1736259265678.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13908i42B9178D4D1E7196/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_0-1736259265678.png" alt="ambigus9_0-1736259265678.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;It is curious that I getting green status of Cluster with the following logs:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_1-1736259322653.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13909iE0CDCDF120C66178/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_1-1736259322653.png" alt="ambigus9_1-1736259322653.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Security Group Workspaces Network - Inbound Rules&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_2-1736260250583.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13910iEA4A73FBDEE3F0C0/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_2-1736260250583.png" alt="ambigus9_2-1736260250583.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Security Group Workspaces Network - Outbound Rules&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_3-1736260295109.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13911iB24F6B351D005196/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_3-1736260295109.png" alt="ambigus9_3-1736260295109.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 07 Jan 2025 14:33:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104523#M2697</guid>
      <dc:creator>ambigus9</dc:creator>
      <dc:date>2025-01-07T14:33:40Z</dc:date>
    </item>
    <item>
      <title>Re: Failed to add 3 workers to the compute. Will attempt retry: true. Reason: Driver unresponsive</title>
      <link>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104525#M2698</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/133005"&gt;@ambigus9&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;Looks like based on connectivity test to the RDS it's not working. Can you check if there is any Firewall blocking the request, since connection is not going through the RDS.&lt;/P&gt;</description>
      <pubDate>Tue, 07 Jan 2025 14:38:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104525#M2698</guid>
      <dc:creator>Alberto_Umana</dc:creator>
      <dc:date>2025-01-07T14:38:24Z</dc:date>
    </item>
    <item>
      <title>Re: Failed to add 3 workers to the compute. Will attempt retry: true. Reason: Driver unresponsive</title>
      <link>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104562#M2700</link>
      <description>&lt;P&gt;Hi,&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/106294"&gt;@Alberto_Umana&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I would like to share with you the VPC resources map:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_0-1736264798342.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13914iE56D0A70E65EDB25/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_0-1736264798342.png" alt="ambigus9_0-1736264798342.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;I using&amp;nbsp;&lt;STRONG&gt;app-private-datalake-subnet-a1&amp;nbsp;&lt;/STRONG&gt;and&amp;nbsp;&lt;STRONG&gt;app-private-datalake-subnet-b1&lt;/STRONG&gt; to deploy the Workspace. Also, the subnet dedicated to the VPC endpoints is&amp;nbsp;&lt;STRONG&gt;uat-datalake-vpc-0a448f9e2a1b0ef4e-pl-vpce.&amp;nbsp;&lt;/STRONG&gt;Is that OK?&lt;/P&gt;&lt;P&gt;Is important to note that this is a custom vpc, and It doesn't have a NAT Gateway associated, it uses a Transit Gateway. Here is the config of the subnes:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;app-private-datalake-subnet-a1 - Route Table&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_1-1736265081748.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13915iFA95D560E1D9D4AF/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_1-1736265081748.png" alt="ambigus9_1-1736265081748.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;app-private-datalake-subnet-a1 - Network ACL&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_3-1736265162236.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13917i8506C538EAD5016E/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_3-1736265162236.png" alt="ambigus9_3-1736265162236.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;app-private-datalake-subnet-b1 - Route Table&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_2-1736265110907.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13916iD75D0430F4106E7E/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_2-1736265110907.png" alt="ambigus9_2-1736265110907.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_4-1736265204915.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13918iD44B5234566F8274/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_4-1736265204915.png" alt="ambigus9_4-1736265204915.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;And it is really frustrating that once again I getting the same error:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_5-1736265338123.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13919i14C5531E0D58380B/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_5-1736265338123.png" alt="ambigus9_5-1736265338123.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_6-1736265355716.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13920i96C31535D2F6AAC7/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_6-1736265355716.png" alt="ambigus9_6-1736265355716.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Security Group - datalake-sg-workspace - Inbound Rules&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_7-1736265438145.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13921i4381E9811E7EAF70/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_7-1736265438145.png" alt="ambigus9_7-1736265438145.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Security Group - Data Plane VPC endpoint security group&lt;BR /&gt;- Inbound Rules&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ambigus9_8-1736265510669.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13922i5E5D8B1B38AD123D/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ambigus9_8-1736265510669.png" alt="ambigus9_8-1736265510669.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Any idea what can makes the error is rising again?&lt;/P&gt;</description>
      <pubDate>Tue, 07 Jan 2025 16:00:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/failed-to-add-3-workers-to-the-compute-will-attempt-retry-true/m-p/104562#M2700</guid>
      <dc:creator>ambigus9</dc:creator>
      <dc:date>2025-01-07T16:00:31Z</dc:date>
    </item>
  </channel>
</rss>

