Databricks Community

abhishek13 · ‎04-02-2026

Hi everyone,

I’m facing a connectivity issue in my Databricks workspace on GCP and would appreciate any guidance.

Problem

When I run commands from a Databricks notebook, I see intermittent errors like:

Connection reset
Retrying request to https://us-east4.gcp.databricks.com:443

However, when I test connectivity manually from the cluster node using curl, it works fine.

What I verified

Direct connectivity works

curl -v https://us-east4.gcp.databricks.com

Resolves to public IP (34.x.x.x)
TLS handshake successful
Returns HTTP 303 → /login.html

DNS resolution is correct

getent hosts us-east4.gcp.databricks.com
→ 34.128.x.x

Proxy removed

- Removed HTTP_PROXY / HTTPS_PROXY environment variables
- Verified no proxy is being used
Issue inside Databricks runtime
- Notebook / Spark jobs still show:
  - Connection reset
  - Retry attempts in logs
Questions
1. Is this expected behavior due to connection reuse / keep-alive in Databricks runtime?
2. Could this be related to JVM/Spark HTTP client behavior?
3. Are there recommended configurations to avoid these connection reset logs?
4. When should this be considered a real failure vs harmless retry?

abhishek13 · ‎04-03-2026

can someone help on this

lingareddy_Alva · ‎04-03-2026

Hi @abhishek13

This is a classic JVM HTTP client vs. system curl discrepancy, and it's very common in Databricks on GCP.

Why curl works but the notebook doesn't
curl uses a fresh TCP connection each time. The Databricks runtime (and Spark internals) use persistent
connection pools — typically Apache HttpClient or OkHttp — which hold connections open across requests.
GCP's load balancers and firewalls have idle timeout policies (often 10 minutes on GCP, sometimes as low as 3–4 minutes
on internal paths), and they silently drop stale connections server-side. The client's connection pool doesn't know
the connection is dead until it tries to reuse it, which produces the Connection reset error on the first attempt.
The retry then opens a fresh socket, which succeeds — hence the "intermittent" pattern.

The single strongest heuristic: if the retry log line is immediately followed by a success log and the job completes,
it is harmless. If you see the same host repeatedly failing across multiple retries without recovery,
that warrants deeper investigation (firewall rule change, DNS flap, or a Databricks control plane issue).

Recommended action plan
Add the TCP keepalive init script — this is the lowest-risk, highest-impact fix and addresses the root cause at the OS level.
Set connection TTL < 540s in any HttpClient pools you control (GCP's effective idle timeout is ~600s but be conservative).
Monitor with Connection reset as a warning, not an error — alert only if retry count per 5-minute window exceeds a threshold (e.g., > 10).
If the errors persist after idle periods specifically, check whether your cluster is using serverless compute — serverless
has different network path characteristics on GCP and may need Databricks support involvement for persistent issues.

Hope this can help you, @abhishek13 .

LR