Connection reset error from Databricks notebook but works via curl (GCP)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-02-2026 04:35 PM - edited 04-02-2026 04:37 PM
Hi everyone,
I’m facing a connectivity issue in my Databricks workspace on GCP and would appreciate any guidance.
Problem
When I run commands from a Databricks notebook, I see intermittent errors like:
Connection reset
Retrying request to https://us-east4.gcp.databricks.com:443However, when I test connectivity manually from the cluster node using curl, it works fine.
What I verified
- Direct connectivity works
curl -v https://us-east4.gcp.databricks.com- Resolves to public IP (34.x.x.x)
- TLS handshake successful
- Returns HTTP 303 → /login.html
- DNS resolution is correct
getent hosts us-east4.gcp.databricks.com
→ 34.128.x.x- Proxy removed
- Removed HTTP_PROXY / HTTPS_PROXY environment variables
- Verified no proxy is being used
Issue inside Databricks runtime
- Notebook / Spark jobs still show:
- Connection reset
- Retry attempts in logs
Questions
- Is this expected behavior due to connection reuse / keep-alive in Databricks runtime?
- Could this be related to JVM/Spark HTTP client behavior?
- Are there recommended configurations to avoid these connection reset logs?
- When should this be considered a real failure vs harmless retry?
- Labels:
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-03-2026 08:56 AM
can someone help on this
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-03-2026 10:36 AM - edited 04-03-2026 10:38 AM
Hi @abhishek13
This is a classic JVM HTTP client vs. system curl discrepancy, and it's very common in Databricks on GCP.
Why curl works but the notebook doesn't
curl uses a fresh TCP connection each time. The Databricks runtime (and Spark internals) use persistent
connection pools — typically Apache HttpClient or OkHttp — which hold connections open across requests.
GCP's load balancers and firewalls have idle timeout policies (often 10 minutes on GCP, sometimes as low as 3–4 minutes
on internal paths), and they silently drop stale connections server-side. The client's connection pool doesn't know
the connection is dead until it tries to reuse it, which produces the Connection reset error on the first attempt.
The retry then opens a fresh socket, which succeeds — hence the "intermittent" pattern.
The single strongest heuristic: if the retry log line is immediately followed by a success log and the job completes,
it is harmless. If you see the same host repeatedly failing across multiple retries without recovery,
that warrants deeper investigation (firewall rule change, DNS flap, or a Databricks control plane issue).
Recommended action plan
Add the TCP keepalive init script — this is the lowest-risk, highest-impact fix and addresses the root cause at the OS level.
Set connection TTL < 540s in any HttpClient pools you control (GCP's effective idle timeout is ~600s but be conservative).
Monitor with Connection reset as a warning, not an error — alert only if retry count per 5-minute window exceeds a threshold (e.g., > 10).
If the errors persist after idle periods specifically, check whether your cluster is using serverless compute — serverless
has different network path characteristics on GCP and may need Databricks support involvement for persistent issues.
Hope this can help you, @abhishek13 .