topic Databricks Job Cluster became unreachable in Data Engineering

Databricks Job Cluster became unreachable

Sadam97 — Fri, 30 May 2025 10:10:36 GMT

We have production streaming jobs running on Job Clusters. We face cluster related errors now and then, one such example is below error.

Run failed with error message Cluster became unreachable during run Cause: Got invalid response: 404 /ERR_NGROK_3200

When the same task was restarted by the retry mechanism, it again failed with the below error.

Run failed with error message The Spark driver failed to start within 900 seconds (cluster: 5523-132933-88po9xvf, driver URL: https://xx.x.x.xx:6060, start date: 2025-05-30 08:05:36 UTC)

Can you help us dig down the reasons.

Re: Databricks Job Cluster became unreachable

HariSankar — Fri, 30 May 2025 13:14:33 GMT

Hey,here are some possible reasons for the errors you're seeing, followed by a few suggestions that might help improve stability.

Reasons for the Errors:

1.Cluster Became Unreachable (404 /ERR_NGROK_3200)

This typically happens when the Databricks control plane can't reach the job cluster. Common causes include:

--> Network connectivity issues between the control plane and the cluster (especially if using VPC peering or PrivateLink).
--> Cluster terminated prematurely, before the control plane could fully connect.
--> Internal proxy/tunnel issues (NGROK is used by Databricks for secure communication with job clusters).
--> Spot instances getting revoked during startup or runtime.

2.Spark Driver Failed to Start within 900 Seconds

This usually points to delays or failures during cluster startup. Some likely causes:

--> Cloud resource constraints,your cloud provider might not have available capacity for the requested VM type in that region.
--> Long or failing init scripts,these can significantly delay driver setup.
--> Too many parallel jobs starting at once can lead to concurrency bottlenecks.

Recommendations:

1.Use a Persistent or Pool-Backed Cluster for Streaming Jobs:
Streaming workloads benefit from stability. Since job clusters are ephemeral, they spin up/down every run, which introduces risk. A persistent cluster avoids
startup overhead and reduces the chance of driver timeouts or unreachable clusters.

2.Avoid Spot Instances for Drivers:
Spot instances are cheaper but not reliable for long-running or stateful workloads like streaming. Consider using on-demand instances for the driver and critical
worker nodes to reduce interruption risks.

3.Review Init Scripts and Cluster Config:
If you're using init scripts, make sure they are optimized and not causing delays. Keep them minimal for streaming jobs, or move some setup into the notebook/task
logic where possible.

4.Monitor Cluster Event Logs:
Always check the event log for failed runs — it often provides clues like node provisioning failures, container errors, or premature terminations.

5.Implement Retry Logic with Backoff:
If you're relying on retries, use exponential backoff to avoid hammering the cluster creation logic in a short time window.

Re: Databricks Job Cluster became unreachable

Louis_Frolio — Fri, 30 May 2025 13:19:18 GMT

Here are some things considerations:

The errors experienced in your production streaming jobs—ERR_NGROK_3200 and Spark driver failed to start within 900 seconds—stem from distinct causes related to connectivity, underlying system constraints, and driver-related issues.

### Cluster Error: 404 / ERR_NGROK_3200 1. Ngrok Connectivity Issues: - The ERR_NGROK_3200 occurs when the Ngrok tunnel for communication between components becomes offline or unavailable. This can happen if the upstream resource is missing, or the requested tunnel is disconnected. Logs also indicate intermittent heartbeat timeout issues and tunnel reconnections, suggesting temporary disruptions in the session.

ELK logs highlight ping failures to cluster driver instances near the time of job failures. Driver communication outages due to networking errors or Ngrok tunneling misconfiguration are key contributors.
Mitigation strategies include increasing retry durations (at least 5 minutes) and leveraging connection pools with timeouts to reduce load during Ngrok service disruptions.

Driver Connectivity Challenges:
- Logs reveal multiple failed attempts to ping the cluster's driver instance, indicating communication issues. This occurred even though cluster metrics appeared normal, suggesting sporadic connectivity loss likely tied to driver health or infrastructure constraints.

Error: The Spark driver failed to start within 900 seconds 1. Driver Instance Overload: - The Spark driver may fail to start within 900 seconds under scenarios of high workload or resource exhaustion. For instance, driver responsiveness issues linked to garbage collection (GC) thrashing or memory pressure were observed in several cases.

Cluster Bootstrap Issues:
- Logs from different incidents show clusters failing due to internal errors, such as unhealthy driver instances or incomplete setups during the bootstrap phase. The root causes can range from cloud provider VM failures, Spark configuration errors, to invalid initialization scripts.
Specific Case Analysis:
- Event traces indicate cases where the driver started successfully but was unresponsive due to heavy loads or communication failures. Increasing the driver node size (e.g., moving to higher instance types) has been shown to mitigate such risks.

Recommendations to Address Challenges 1. Ngrok-Related Resolutions: - Extend retry intervals to mitigate downtime during tunneling disruptions. - Investigate Ngrok logs to ensure proper tunnel establishment and reduce service overloads using connection pools.

Driver and Cluster Stability:
- Increase driver instance memory or switch to larger instance types to mitigate GC and memory pressure issues.
- Check Spark configurations and initialization scripts for errors or inefficiencies.
Monitoring and Diagnostics:
- Leverage cluster health dashboards to analyze driver metrics (e.g., GC rates, load averages) and pinpoint problem areas.
- Enable detailed logging during Spark driver initialization steps for timely identification of anomalies.

These errors require further investigation into networking, VM health, and Spark driver behavior, but the mitigation steps provided can help improve cluster reliability and reduce job failures.