<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Intermittent task execution issues in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/intermittent-task-execution-issues/m-p/144209#M52284</link>
    <description>&lt;P&gt;We're getting intermittent errors:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;[ISOLATION_STARTUP_FAILURE.SANDBOX_STARTUP] Failed to start isolated execution environment. Sandbox startup failed.
Exception class: INTERNAL.
Exception message: INTERNAL: LaunchSandboxRequest create failed - Error executing LivenessCheckStep: failed to perform livenessCommand for container [REDACTED] with commands sh -c (grep -qE ':[0]*1F40' /proc/net/tcp) || (grep -qE ':[0]*1F40' /proc/net/tcp6) || (echo "Error: No process listening on port 8000" &amp;amp;&amp;amp; exit 1) and error max deadline has passed, failed to perform livenessCommand for container [REDACTED] with error , cpu.stat: NrPeriods = 0,  NrThrottled = 0, ThrottledTime = 0.
Last sandbox stdout: .
Last sandbox stderr: .
Please contact Databricks support. SQLSTATE: XXKSS&lt;/LI-CODE&gt;&lt;P&gt;These take several minutes to "complete" (i.e. fail) and retrying seems to repeat the issue. This is just one of the ways we need to babysit our ETL jobs every now and then. This is on serverless compute, but it can happen on other types of compute as well.&lt;/P&gt;&lt;P&gt;Is Databricks aware of these issues and monitoring this?&lt;/P&gt;</description>
    <pubDate>Fri, 16 Jan 2026 06:38:44 GMT</pubDate>
    <dc:creator>Malthe</dc:creator>
    <dc:date>2026-01-16T06:38:44Z</dc:date>
    <item>
      <title>Intermittent task execution issues</title>
      <link>https://community.databricks.com/t5/data-engineering/intermittent-task-execution-issues/m-p/144209#M52284</link>
      <description>&lt;P&gt;We're getting intermittent errors:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;[ISOLATION_STARTUP_FAILURE.SANDBOX_STARTUP] Failed to start isolated execution environment. Sandbox startup failed.
Exception class: INTERNAL.
Exception message: INTERNAL: LaunchSandboxRequest create failed - Error executing LivenessCheckStep: failed to perform livenessCommand for container [REDACTED] with commands sh -c (grep -qE ':[0]*1F40' /proc/net/tcp) || (grep -qE ':[0]*1F40' /proc/net/tcp6) || (echo "Error: No process listening on port 8000" &amp;amp;&amp;amp; exit 1) and error max deadline has passed, failed to perform livenessCommand for container [REDACTED] with error , cpu.stat: NrPeriods = 0,  NrThrottled = 0, ThrottledTime = 0.
Last sandbox stdout: .
Last sandbox stderr: .
Please contact Databricks support. SQLSTATE: XXKSS&lt;/LI-CODE&gt;&lt;P&gt;These take several minutes to "complete" (i.e. fail) and retrying seems to repeat the issue. This is just one of the ways we need to babysit our ETL jobs every now and then. This is on serverless compute, but it can happen on other types of compute as well.&lt;/P&gt;&lt;P&gt;Is Databricks aware of these issues and monitoring this?&lt;/P&gt;</description>
      <pubDate>Fri, 16 Jan 2026 06:38:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/intermittent-task-execution-issues/m-p/144209#M52284</guid>
      <dc:creator>Malthe</dc:creator>
      <dc:date>2026-01-16T06:38:44Z</dc:date>
    </item>
    <item>
      <title>Re: Intermittent task execution issues</title>
      <link>https://community.databricks.com/t5/data-engineering/intermittent-task-execution-issues/m-p/144211#M52285</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9268"&gt;@Malthe&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;This might be because of New DBR (18.0) GA release yesterday(&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/release-notes/product/2026/january" target="_blank"&gt;January 2026 - Azure Databricks | Microsoft Learn&lt;/A&gt;). you might need to use a custom spark version by the time engineering team fixes this issue in DBR. Below is the response from Databricks Support for similar sort of problem.&lt;/P&gt;&lt;P&gt;"&lt;/P&gt;&lt;P class="lia-align-left lia-indent-padding-left-30px"&gt;There was a DBR release on 7th August (14.3.10 -&amp;gt; 14.3.11).&lt;BR /&gt;Our engineering team identified the issue and the fix is scheduled to be deployed on September 16th.&lt;BR /&gt;&lt;BR /&gt;Until the fix is deployed, You can use the below custom spark image version in your cluster.&lt;BR /&gt;&lt;BR /&gt;enter the below in the&amp;nbsp;Custom Spark Version: The custom image provided was the old DBR prior to 8th August.&lt;BR /&gt;&lt;EM&gt;custom:release__14.3.x-snapshot-scala2.12__databricks-universe__14.3.10__9b6cd4f__debafb7__jenkins__1cbb705__format-3&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;"&lt;/EM&gt;&lt;BR /&gt;&lt;BR /&gt;link to instruction how to enable definition of custom spark version&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;A href="https://kb.databricks.com/en_US/clusters/run-a-custom-databricks-runtime-on-your-cluster" target="_blank"&gt;Run a custom Databricks Runtime on your cluster - Databricks&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Jan 2026 07:06:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/intermittent-task-execution-issues/m-p/144211#M52285</guid>
      <dc:creator>sandy_123</dc:creator>
      <dc:date>2026-01-16T07:06:36Z</dc:date>
    </item>
    <item>
      <title>Re: Intermittent task execution issues</title>
      <link>https://community.databricks.com/t5/data-engineering/intermittent-task-execution-issues/m-p/144390#M52316</link>
      <description>&lt;P&gt;According to&amp;nbsp;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/release-notes/serverless/," target="_blank" rel="noopener"&gt;https://learn.microsoft.com/en-us/azure/databricks/release-notes/serverless/,&lt;/A&gt;&amp;nbsp;17.3 is the latest release for serverless and we're on Serverless Environment 4.&lt;/P&gt;&lt;P&gt;Here's the trackback:&lt;/P&gt;&lt;LI-CODE lang="java"&gt;File /databricks/python/lib/python3.12/site-packages/pyspark/sql/connect/client/core.py:2433, in SparkConnectClient._handle_rpc_error(self, rpc_error)
   2429             logger.debug(f"Received ErrorInfo: {info}")
   2431             self._handle_rpc_error_with_error_info(info, status.message, status_code)  # EDGE
-&amp;gt; 2433             raise convert_exception(
   2434                 info,
   2435                 status.message,
   2436                 self._fetch_enriched_error(info),
   2437                 self._display_server_stack_trace(),
   2438                 status_code,
   2439             ) from None
   2441     raise SparkConnectGrpcException(
   2442         message=status.message,
   2443         sql_state=ErrorCode.CLIENT_UNEXPECTED_MISSING_SQL_STATE,  # EDGE
   2444         grpc_status_code=status_code,
   2445     ) from None
   2446 else:&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;It happened during a Delta Lake merge operation and just now again (same exact task out of dozens of tasks in our job).&lt;/P&gt;</description>
      <pubDate>Mon, 19 Jan 2026 08:13:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/intermittent-task-execution-issues/m-p/144390#M52316</guid>
      <dc:creator>Malthe</dc:creator>
      <dc:date>2026-01-19T08:13:46Z</dc:date>
    </item>
    <item>
      <title>Re: Intermittent task execution issues</title>
      <link>https://community.databricks.com/t5/data-engineering/intermittent-task-execution-issues/m-p/144605#M52350</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9268"&gt;@Malthe&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;Please check if custom Spark image is used in the jobs. If it is, try to remove it and stick to default parameters.&lt;/P&gt;
&lt;P&gt;If not, I highly recommend to open a support ticket (assuming you are on Azure Databricks) via Azure portal.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Best regards,&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 20 Jan 2026 15:49:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/intermittent-task-execution-issues/m-p/144605#M52350</guid>
      <dc:creator>aleksandra_ch</dc:creator>
      <dc:date>2026-01-20T15:49:49Z</dc:date>
    </item>
    <item>
      <title>Hi @Malthe,  The ISOLATION_STARTUP_FAILURE.SANDBOX_STARTU...</title>
      <link>https://community.databricks.com/t5/data-engineering/intermittent-task-execution-issues/m-p/150234#M53310</link>
      <description>Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9268"&gt;@Malthe&lt;/a&gt;,&lt;BR /&gt;&lt;BR /&gt;The ISOLATION_STARTUP_FAILURE.SANDBOX_STARTUP error you are seeing is a transient infrastructure-level issue where the serverless execution sandbox fails its internal liveness check before your code even starts running. Since the error message itself says "Please contact Databricks support" and includes SQLSTATE: XXKSS, this is recognized as an internal platform error rather than something caused by your code or configuration.&lt;BR /&gt;&lt;BR /&gt;WHAT IS HAPPENING&lt;BR /&gt;&lt;BR /&gt;On serverless compute, each task runs inside an isolated sandbox container. During startup, the platform performs a liveness check (verifying a process is listening on port 8000). When that check exceeds its deadline, the sandbox is marked as failed and the task errors out with the message you posted. Because the sandbox never fully started, the retry often hits the same transient condition, especially if the underlying infrastructure is temporarily constrained.&lt;BR /&gt;&lt;BR /&gt;RECOMMENDED ACTIONS&lt;BR /&gt;&lt;BR /&gt;1. Configure task-level retries with a delay: In your job configuration, add a retry policy to each task. Serverless jobs can auto-optimize retries, but you can also set explicit retries (e.g., 2-3 max retries) with a retry interval. The interval is calculated in milliseconds between the start of the failed run and the next retry. Adding a delay (e.g., 60000 ms) gives the platform time to recover before the next attempt. You can configure this in the job UI by clicking "+ Add" next to "Retries" in the task panel, or via the Jobs API retry_policy field.&lt;BR /&gt;&lt;BR /&gt;2. Open a support ticket: Since the error explicitly says "Please contact Databricks support" and includes an internal exception class, Databricks Support can correlate the timestamps with backend telemetry to determine whether this was tied to a specific deployment rollout, a regional capacity event, or another root cause. Given that you are on Azure Databricks, you can open a ticket directly through the Azure portal. Include the job run IDs, task run IDs, timestamps (with timezone), and the workspace URL so support can look up the exact sandbox that failed.&lt;BR /&gt;&lt;BR /&gt;3. Monitor the Databricks status page: You can check service health at &lt;A href="https://status.databricks.com" target="_blank"&gt;https://status.databricks.com&lt;/A&gt; (for AWS workspaces) or &lt;A href="https://status.azuredatabricks.net" target="_blank"&gt;https://status.azuredatabricks.net&lt;/A&gt; (for Azure). You can also subscribe to email, webhook, or Slack notifications for your region so you are alerted proactively when there is a platform-level incident.&lt;BR /&gt;&lt;BR /&gt;4. Review your job for timeout settings: You mentioned these failures take several minutes before they complete as failed. You can set an execution timeout using the spark.databricks.execution.timeout Spark property for serverless jobs to cap how long a task waits before being marked as timed out. Combining this with a retry policy ensures that a sandbox startup stall does not block your entire ETL pipeline for an extended period.&lt;BR /&gt;&lt;BR /&gt;REGARDING THE CUSTOM SPARK VERSION SUGGESTION&lt;BR /&gt;&lt;BR /&gt;The suggestion from &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/208844"&gt;@sandy_123&lt;/a&gt; about pinning to a custom Spark version applies to classic (non-serverless) compute where you control the Databricks Runtime version. On serverless compute, you cannot pin a specific runtime version because serverless is a versionless product where Databricks automatically manages the runtime. If you also experience this error on classic compute clusters, pinning to a known-good DBR version can be a valid temporary workaround while a fix is rolled out.&lt;BR /&gt;&lt;BR /&gt;REGARDING THE DELTA MERGE CONTEXT&lt;BR /&gt;&lt;BR /&gt;You noted this keeps happening on the same Delta Lake merge task. If the merge operation is particularly large or resource-intensive, the sandbox may be more susceptible to startup timeouts under load. Consider whether breaking that merge into smaller batches or optimizing the merge predicate could help reduce the resource pressure at startup time.&lt;BR /&gt;&lt;BR /&gt;SUMMARY&lt;BR /&gt;&lt;BR /&gt;- This is a transient platform-level error, not caused by your code&lt;BR /&gt;- Add task retries with an interval delay to make your ETL more resilient&lt;BR /&gt;- Open a support ticket with run IDs and timestamps for root cause analysis&lt;BR /&gt;- Subscribe to Databricks status notifications for your region&lt;BR /&gt;- On serverless, you cannot pin a custom runtime version, but retries and timeouts are your primary levers&lt;BR /&gt;&lt;BR /&gt;Documentation references:&lt;BR /&gt;- Repair and retry failed jobs: &lt;A href="https://docs.databricks.com/en/jobs/repair-job-failures.html" target="_blank"&gt;https://docs.databricks.com/en/jobs/repair-job-failures.html&lt;/A&gt;&lt;BR /&gt;- Configure tasks: &lt;A href="https://docs.databricks.com/en/jobs/configure-task.html" target="_blank"&gt;https://docs.databricks.com/en/jobs/configure-task.html&lt;/A&gt;&lt;BR /&gt;- Serverless compute overview: &lt;A href="https://learn.microsoft.com/en-us/azure/databricks/compute/serverless/" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/compute/serverless/&lt;/A&gt;&lt;BR /&gt;- Serverless compute limitations: &lt;A href="https://learn.microsoft.com/en-us/azure/databricks/compute/serverless/limitations" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/compute/serverless/limitations&lt;/A&gt;&lt;BR /&gt;- Databricks status page (Azure): &lt;A href="https://status.azuredatabricks.net" target="_blank"&gt;https://status.azuredatabricks.net&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.&lt;BR /&gt;&lt;BR /&gt;If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.</description>
      <pubDate>Sun, 08 Mar 2026 18:26:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/intermittent-task-execution-issues/m-p/150234#M53310</guid>
      <dc:creator>SteveOstrowski</dc:creator>
      <dc:date>2026-03-08T18:26:45Z</dc:date>
    </item>
  </channel>
</rss>

