<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Bootstrap cluster timeout for job pipeline - databricks bug? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/bootstrap-cluster-timeout-for-job-pipeline-databricks-bug/m-p/106424#M42485</link>
    <description>&lt;P&gt;From time to time we have these erors in scheduled PROD runs. It happens when job starts and tries to create one time cluster. It happens 1 time from 10-20 runs and we are not able to identify the root cause, as all network connectivity is fine, some other jobs works fine at the same time. Why does it happen? Seems some bugs in Databricks during cluster creation?&lt;/P&gt;&lt;P&gt;Job's one time cluster config:&lt;/P&gt;&lt;DIV&gt;&lt;PRE&gt;&lt;SPAN&gt;job_clusters&lt;/SPAN&gt;:&lt;BR /&gt;  - &lt;SPAN&gt;job_cluster_key&lt;/SPAN&gt;: my_cluster&lt;BR /&gt;    &lt;SPAN&gt;new_cluster&lt;/SPAN&gt;:&lt;BR /&gt;      &lt;SPAN&gt;cluster_name&lt;/SPAN&gt;: &lt;SPAN&gt;""&lt;BR /&gt;&lt;/SPAN&gt;      &lt;SPAN&gt;spark_version&lt;/SPAN&gt;: 15.4.x-scala2.12&lt;BR /&gt;      &lt;SPAN&gt;spark_conf&lt;/SPAN&gt;:&lt;BR /&gt;        &lt;SPAN&gt;spark.databricks.cluster.profile&lt;/SPAN&gt;: singleNode&lt;BR /&gt;        &lt;SPAN&gt;spark.master&lt;/SPAN&gt;: local[*, 4]&lt;BR /&gt;      &lt;SPAN&gt;aws_attributes&lt;/SPAN&gt;:&lt;BR /&gt;        &lt;SPAN&gt;first_on_demand&lt;/SPAN&gt;: 1&lt;BR /&gt;        &lt;SPAN&gt;availability&lt;/SPAN&gt;: SPOT_WITH_FALLBACK&lt;BR /&gt;        &lt;SPAN&gt;zone_id&lt;/SPAN&gt;: eu-west-1b&lt;BR /&gt;        &lt;SPAN&gt;spot_bid_price_percent&lt;/SPAN&gt;: 100&lt;BR /&gt;      &lt;SPAN&gt;node_type_id&lt;/SPAN&gt;: m5d.4xlarge&lt;BR /&gt;      &lt;SPAN&gt;driver_node_type_id&lt;/SPAN&gt;: m5d.4xlarge&lt;BR /&gt;      &lt;SPAN&gt;custom_tags&lt;/SPAN&gt;:&lt;BR /&gt;        &lt;SPAN&gt;ResourceClass&lt;/SPAN&gt;: SingleNode&lt;BR /&gt;      &lt;SPAN&gt;enable_elastic_disk&lt;/SPAN&gt;: true&lt;BR /&gt;      &lt;SPAN&gt;data_security_mode&lt;/SPAN&gt;: SINGLE_USER&lt;BR /&gt;      &lt;SPAN&gt;runtime_engine&lt;/SPAN&gt;: STANDARD&lt;BR /&gt;      &lt;SPAN&gt;num_workers&lt;/SPAN&gt;: 0&lt;/PRE&gt;&lt;P&gt;The error we get, impacts out PROD runs and this is really annoying:&lt;/P&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;run failed with error message
 Cluster '0120-205753-51ldqtu1' was terminated. Reason: BOOTSTRAP_TIMEOUT (SERVICE_FAULT). Parameters: databricks_error_message:[id: InstanceId(i-0a8e2c9776c79e66d), status: INSTANCE_INITIALIZING, workerEnvId:WorkerEnvId(workerenv-3386680009775160-1371cfd7-90c5-4a02-84fd-eedf9d7fa269), lastStatusChangeTime: 1737406707396, groupIdOpt Some(0),requestIdOpt Some(0120-205753-51ldqtu1-a444ac40-a32d-4c63-b),version 1] with threshold 700 seconds timed out after 703368 milliseconds. Instance bootstrap inferred timeout reason: UnknownReason. Please check network connectivity from the data plane to the control plane., instance_id:i-0a8e2c9776c79e66d.&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;What does it mean Unknown Reason?&lt;/P&gt;&lt;P&gt;Do you know how we could fix this? all network connectivity is fine.&lt;/P&gt;</description>
    <pubDate>Tue, 21 Jan 2025 09:24:11 GMT</pubDate>
    <dc:creator>drag7ter</dc:creator>
    <dc:date>2025-01-21T09:24:11Z</dc:date>
    <item>
      <title>Bootstrap cluster timeout for job pipeline - databricks bug?</title>
      <link>https://community.databricks.com/t5/data-engineering/bootstrap-cluster-timeout-for-job-pipeline-databricks-bug/m-p/106424#M42485</link>
      <description>&lt;P&gt;From time to time we have these erors in scheduled PROD runs. It happens when job starts and tries to create one time cluster. It happens 1 time from 10-20 runs and we are not able to identify the root cause, as all network connectivity is fine, some other jobs works fine at the same time. Why does it happen? Seems some bugs in Databricks during cluster creation?&lt;/P&gt;&lt;P&gt;Job's one time cluster config:&lt;/P&gt;&lt;DIV&gt;&lt;PRE&gt;&lt;SPAN&gt;job_clusters&lt;/SPAN&gt;:&lt;BR /&gt;  - &lt;SPAN&gt;job_cluster_key&lt;/SPAN&gt;: my_cluster&lt;BR /&gt;    &lt;SPAN&gt;new_cluster&lt;/SPAN&gt;:&lt;BR /&gt;      &lt;SPAN&gt;cluster_name&lt;/SPAN&gt;: &lt;SPAN&gt;""&lt;BR /&gt;&lt;/SPAN&gt;      &lt;SPAN&gt;spark_version&lt;/SPAN&gt;: 15.4.x-scala2.12&lt;BR /&gt;      &lt;SPAN&gt;spark_conf&lt;/SPAN&gt;:&lt;BR /&gt;        &lt;SPAN&gt;spark.databricks.cluster.profile&lt;/SPAN&gt;: singleNode&lt;BR /&gt;        &lt;SPAN&gt;spark.master&lt;/SPAN&gt;: local[*, 4]&lt;BR /&gt;      &lt;SPAN&gt;aws_attributes&lt;/SPAN&gt;:&lt;BR /&gt;        &lt;SPAN&gt;first_on_demand&lt;/SPAN&gt;: 1&lt;BR /&gt;        &lt;SPAN&gt;availability&lt;/SPAN&gt;: SPOT_WITH_FALLBACK&lt;BR /&gt;        &lt;SPAN&gt;zone_id&lt;/SPAN&gt;: eu-west-1b&lt;BR /&gt;        &lt;SPAN&gt;spot_bid_price_percent&lt;/SPAN&gt;: 100&lt;BR /&gt;      &lt;SPAN&gt;node_type_id&lt;/SPAN&gt;: m5d.4xlarge&lt;BR /&gt;      &lt;SPAN&gt;driver_node_type_id&lt;/SPAN&gt;: m5d.4xlarge&lt;BR /&gt;      &lt;SPAN&gt;custom_tags&lt;/SPAN&gt;:&lt;BR /&gt;        &lt;SPAN&gt;ResourceClass&lt;/SPAN&gt;: SingleNode&lt;BR /&gt;      &lt;SPAN&gt;enable_elastic_disk&lt;/SPAN&gt;: true&lt;BR /&gt;      &lt;SPAN&gt;data_security_mode&lt;/SPAN&gt;: SINGLE_USER&lt;BR /&gt;      &lt;SPAN&gt;runtime_engine&lt;/SPAN&gt;: STANDARD&lt;BR /&gt;      &lt;SPAN&gt;num_workers&lt;/SPAN&gt;: 0&lt;/PRE&gt;&lt;P&gt;The error we get, impacts out PROD runs and this is really annoying:&lt;/P&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;run failed with error message
 Cluster '0120-205753-51ldqtu1' was terminated. Reason: BOOTSTRAP_TIMEOUT (SERVICE_FAULT). Parameters: databricks_error_message:[id: InstanceId(i-0a8e2c9776c79e66d), status: INSTANCE_INITIALIZING, workerEnvId:WorkerEnvId(workerenv-3386680009775160-1371cfd7-90c5-4a02-84fd-eedf9d7fa269), lastStatusChangeTime: 1737406707396, groupIdOpt Some(0),requestIdOpt Some(0120-205753-51ldqtu1-a444ac40-a32d-4c63-b),version 1] with threshold 700 seconds timed out after 703368 milliseconds. Instance bootstrap inferred timeout reason: UnknownReason. Please check network connectivity from the data plane to the control plane., instance_id:i-0a8e2c9776c79e66d.&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;What does it mean Unknown Reason?&lt;/P&gt;&lt;P&gt;Do you know how we could fix this? all network connectivity is fine.&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jan 2025 09:24:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/bootstrap-cluster-timeout-for-job-pipeline-databricks-bug/m-p/106424#M42485</guid>
      <dc:creator>drag7ter</dc:creator>
      <dc:date>2025-01-21T09:24:11Z</dc:date>
    </item>
    <item>
      <title>Re: Bootstrap cluster timeout for job pipeline - databricks bug?</title>
      <link>https://community.databricks.com/t5/data-engineering/bootstrap-cluster-timeout-for-job-pipeline-databricks-bug/m-p/108253#M43007</link>
      <description>&lt;P&gt;The error message "BOOTSTRAP_TIMEOUT (SERVICE_FAULT)" indicates that the cluster was terminated because it took too long to initialize. This can happen due to various reasons, including network connectivity issues between the data plane and the control plane, or issues with the cloud provider's infrastructure.&lt;/P&gt;
&lt;P&gt;Given the intermittent nature of the issue (1 in 10-20 runs), it might be challenging to pinpoint the exact cause. Monitoring the infrastructure and keeping track of when the errors occur can help identify any patterns or recurring issues.&lt;/P&gt;
&lt;P&gt;I would suggest debugging it right when the issue is seen and checking all the logs and also check from the cloud provider.&lt;/P&gt;</description>
      <pubDate>Sat, 01 Feb 2025 05:23:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/bootstrap-cluster-timeout-for-job-pipeline-databricks-bug/m-p/108253#M43007</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2025-02-01T05:23:19Z</dc:date>
    </item>
  </channel>
</rss>

