Hi @abhishek13
This is a classic JVM HTTP client vs. system curl discrepancy, and it's very common in Databricks on GCP.
Why curl works but the notebook doesn't
curl uses a fresh TCP connection each time. The Databricks runtime (and Spark internals) use persistent
connection pools โ typically Apache HttpClient or OkHttp โ which hold connections open across requests.
GCP's load balancers and firewalls have idle timeout policies (often 10 minutes on GCP, sometimes as low as 3โ4 minutes
on internal paths), and they silently drop stale connections server-side. The client's connection pool doesn't know
the connection is dead until it tries to reuse it, which produces the Connection reset error on the first attempt.
The retry then opens a fresh socket, which succeeds โ hence the "intermittent" pattern.
The single strongest heuristic: if the retry log line is immediately followed by a success log and the job completes,
it is harmless. If you see the same host repeatedly failing across multiple retries without recovery,
that warrants deeper investigation (firewall rule change, DNS flap, or a Databricks control plane issue).
Recommended action plan
Add the TCP keepalive init script โ this is the lowest-risk, highest-impact fix and addresses the root cause at the OS level.
Set connection TTL < 540s in any HttpClient pools you control (GCP's effective idle timeout is ~600s but be conservative).
Monitor with Connection reset as a warning, not an error โ alert only if retry count per 5-minute window exceeds a threshold (e.g., > 10).
If the errors persist after idle periods specifically, check whether your cluster is using serverless compute โ serverless
has different network path characteristics on GCP and may need Databricks support involvement for persistent issues.
Hope this can help you, @abhishek13 .
LR