cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Intermittent task execution issues

Malthe
Valued Contributor II

We're getting intermittent errors:

[ISOLATION_STARTUP_FAILURE.SANDBOX_STARTUP] Failed to start isolated execution environment. Sandbox startup failed.
Exception class: INTERNAL.
Exception message: INTERNAL: LaunchSandboxRequest create failed - Error executing LivenessCheckStep: failed to perform livenessCommand for container [REDACTED] with commands sh -c (grep -qE ':[0]*1F40' /proc/net/tcp) || (grep -qE ':[0]*1F40' /proc/net/tcp6) || (echo "Error: No process listening on port 8000" && exit 1) and error max deadline has passed, failed to perform livenessCommand for container [REDACTED] with error , cpu.stat: NrPeriods = 0,  NrThrottled = 0, ThrottledTime = 0.
Last sandbox stdout: .
Last sandbox stderr: .
Please contact Databricks support. SQLSTATE: XXKSS

These take several minutes to "complete" (i.e. fail) and retrying seems to repeat the issue. This is just one of the ways we need to babysit our ETL jobs every now and then. This is on serverless compute, but it can happen on other types of compute as well.

Is Databricks aware of these issues and monitoring this?

4 REPLIES 4

sandy_123
Databricks Partner

Hi @Malthe ,

This might be because of New DBR (18.0) GA release yesterday(January 2026 - Azure Databricks | Microsoft Learn). you might need to use a custom spark version by the time engineering team fixes this issue in DBR. Below is the response from Databricks Support for similar sort of problem.

"

There was a DBR release on 7th August (14.3.10 -> 14.3.11).
Our engineering team identified the issue and the fix is scheduled to be deployed on September 16th.

Until the fix is deployed, You can use the below custom spark image version in your cluster.

enter the below in the Custom Spark Version: The custom image provided was the old DBR prior to 8th August.
custom:release__14.3.x-snapshot-scala2.12__databricks-universe__14.3.10__9b6cd4f__debafb7__jenkins__1cbb705__format-3

"

link to instruction how to enable definition of custom spark version 

Run a custom Databricks Runtime on your cluster - Databricks

Malthe
Valued Contributor II

According to https://learn.microsoft.com/en-us/azure/databricks/release-notes/serverless/, 17.3 is the latest release for serverless and we're on Serverless Environment 4.

Here's the trackback:

File /databricks/python/lib/python3.12/site-packages/pyspark/sql/connect/client/core.py:2433, in SparkConnectClient._handle_rpc_error(self, rpc_error)
   2429             logger.debug(f"Received ErrorInfo: {info}")
   2431             self._handle_rpc_error_with_error_info(info, status.message, status_code)  # EDGE
-> 2433             raise convert_exception(
   2434                 info,
   2435                 status.message,
   2436                 self._fetch_enriched_error(info),
   2437                 self._display_server_stack_trace(),
   2438                 status_code,
   2439             ) from None
   2441     raise SparkConnectGrpcException(
   2442         message=status.message,
   2443         sql_state=ErrorCode.CLIENT_UNEXPECTED_MISSING_SQL_STATE,  # EDGE
   2444         grpc_status_code=status_code,
   2445     ) from None
   2446 else:

 It happened during a Delta Lake merge operation and just now again (same exact task out of dozens of tasks in our job).

aleksandra_ch
Databricks Employee
Databricks Employee

Hi @Malthe ,

Please check if custom Spark image is used in the jobs. If it is, try to remove it and stick to default parameters.

If not, I highly recommend to open a support ticket (assuming you are on Azure Databricks) via Azure portal. 

Best regards, 

SteveOstrowski
Databricks Employee
Databricks Employee
Hi @Malthe,

The ISOLATION_STARTUP_FAILURE.SANDBOX_STARTUP error you are seeing is a transient infrastructure-level issue where the serverless execution sandbox fails its internal liveness check before your code even starts running. Since the error message itself says "Please contact Databricks support" and includes SQLSTATE: XXKSS, this is recognized as an internal platform error rather than something caused by your code or configuration.

WHAT IS HAPPENING

On serverless compute, each task runs inside an isolated sandbox container. During startup, the platform performs a liveness check (verifying a process is listening on port 8000). When that check exceeds its deadline, the sandbox is marked as failed and the task errors out with the message you posted. Because the sandbox never fully started, the retry often hits the same transient condition, especially if the underlying infrastructure is temporarily constrained.

RECOMMENDED ACTIONS

1. Configure task-level retries with a delay: In your job configuration, add a retry policy to each task. Serverless jobs can auto-optimize retries, but you can also set explicit retries (e.g., 2-3 max retries) with a retry interval. The interval is calculated in milliseconds between the start of the failed run and the next retry. Adding a delay (e.g., 60000 ms) gives the platform time to recover before the next attempt. You can configure this in the job UI by clicking "+ Add" next to "Retries" in the task panel, or via the Jobs API retry_policy field.

2. Open a support ticket: Since the error explicitly says "Please contact Databricks support" and includes an internal exception class, Databricks Support can correlate the timestamps with backend telemetry to determine whether this was tied to a specific deployment rollout, a regional capacity event, or another root cause. Given that you are on Azure Databricks, you can open a ticket directly through the Azure portal. Include the job run IDs, task run IDs, timestamps (with timezone), and the workspace URL so support can look up the exact sandbox that failed.

3. Monitor the Databricks status page: You can check service health at https://status.databricks.com (for AWS workspaces) or https://status.azuredatabricks.net (for Azure). You can also subscribe to email, webhook, or Slack notifications for your region so you are alerted proactively when there is a platform-level incident.

4. Review your job for timeout settings: You mentioned these failures take several minutes before they complete as failed. You can set an execution timeout using the spark.databricks.execution.timeout Spark property for serverless jobs to cap how long a task waits before being marked as timed out. Combining this with a retry policy ensures that a sandbox startup stall does not block your entire ETL pipeline for an extended period.

REGARDING THE CUSTOM SPARK VERSION SUGGESTION

The suggestion from @sandy_123 about pinning to a custom Spark version applies to classic (non-serverless) compute where you control the Databricks Runtime version. On serverless compute, you cannot pin a specific runtime version because serverless is a versionless product where Databricks automatically manages the runtime. If you also experience this error on classic compute clusters, pinning to a known-good DBR version can be a valid temporary workaround while a fix is rolled out.

REGARDING THE DELTA MERGE CONTEXT

You noted this keeps happening on the same Delta Lake merge task. If the merge operation is particularly large or resource-intensive, the sandbox may be more susceptible to startup timeouts under load. Consider whether breaking that merge into smaller batches or optimizing the merge predicate could help reduce the resource pressure at startup time.

SUMMARY

- This is a transient platform-level error, not caused by your code
- Add task retries with an interval delay to make your ETL more resilient
- Open a support ticket with run IDs and timestamps for root cause analysis
- Subscribe to Databricks status notifications for your region
- On serverless, you cannot pin a custom runtime version, but retries and timeouts are your primary levers

Documentation references:
- Repair and retry failed jobs: https://docs.databricks.com/en/jobs/repair-job-failures.html
- Configure tasks: https://docs.databricks.com/en/jobs/configure-task.html
- Serverless compute overview: https://learn.microsoft.com/en-us/azure/databricks/compute/serverless/
- Serverless compute limitations: https://learn.microsoft.com/en-us/azure/databricks/compute/serverless/limitations
- Databricks status page (Azure): https://status.azuredatabricks.net

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.