- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-09-2026 10:03 PM
This is a well-known class of issue with gRPC/HTTP2 long-lived streams being killed by network intermediaries. The fact that the Databricks SQL Connector (poll-based HTTP/1.1) works perfectly while Spark Connect (gRPC/HTTP2 streaming) fails is the key diagnostic clue.
Root Cause: Network Intermediaries Killing HTTP/2 Streams
Databricks Connect uses gRPC over HTTP/2, which maintains a long-lived streaming connection. During query execution on the server, this connection appears idle from the network's perspective (no data flowing client-ward). Network devices between your client and Databricks -- corporate proxies, firewalls, load balancers, WAFs, or NAT gateways -- often have idle connection timeouts that terminate connections they consider inactive.
The failure sequence:
- Client opens gRPC stream to Databricks serverless
- Query executes on server (takes N seconds/minutes)
- During execution, the gRPC stream is "idle" (no response data yet)
- Network intermediary kills the "idle" HTTP/2 connection
- Server finishes query, tries to stream results back
- Connection is already dead -- results have nowhere to go
- Client never receives data, eventually times out and cancels
This explains why resultfetchduration_ms = 0 -- the result delivery channel was severed before streaming could begin.
Diagnostic Steps
Step 1: Confirm the network theory
Test from a machine with direct internet access (no corporate proxy/VPN):
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.remote(
host="https://<workspace>.cloud.databricks.com",
token="<pat>",
cluster_id="serverless"
).getOrCreate()
# Run a query that takes 30+ seconds
df = spark.sql("SELECT *, sha2(cast(id as string), 256) FROM range(10000000)")
result = df.collect()
print(f"Got {len(result)} rows")
If this works from a clean network but fails from your corporate network, the issue is confirmed as a network intermediary.
Step 2: Check for proxies
# Check if HTTP/HTTPS proxy is configured
echo $HTTP_PROXY $HTTPS_PROXY $http_proxy $https_proxy
# Check if a corporate proxy intercepts traffic
curl -v https://<workspace>.cloud.databricks.com 2>&1 | grep -i proxy
Step 3: Enable gRPC debug logging
export SPARK_CONNECT_LOG_LEVEL=debug
export GRPC_TRACE=all
export GRPC_VERBOSITY=DEBUG
Then run your query and look for connection reset, stream closed, or EOF errors in the logs.
Solutions
Solution 1: Configure gRPC Keepalive (Most Effective)
Force the gRPC channel to send periodic PING frames, preventing intermediaries from treating the connection as idle:
import grpc
from databricks.connect import DatabricksSession
# Configure keepalive options
spark = DatabricksSession.builder.remote(
host="https://<workspace>.cloud.databricks.com",
token="<pat>",
cluster_id="serverless"
).header("grpc-keepalive-time-ms", "10000") \
.header("grpc-keepalive-timeout-ms", "5000") \
.getOrCreate()
If custom headers don't work for keepalive, try setting environment variables before creating the session:
import os
os.environ["GRPC_KEEPALIVE_TIME_MS"] = "10000" # Send ping every 10s
os.environ["GRPC_KEEPALIVE_TIMEOUT_MS"] = "5000" # Wait 5s for pong
os.environ["GRPC_KEEPALIVE_PERMIT_WITHOUT_CALLS"] = "1" # Ping even when idle
os.environ["GRPC_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS"] = "5000"
Solution 2: Bypass Corporate Proxy for Databricks Traffic
If you're behind a corporate proxy, configure a proxy bypass for your Databricks workspace:
# Add to your environment
export NO_PROXY=".cloud.databricks.com,.azuredatabricks.net"
Or configure your proxy (Squid, Zscaler, etc.) to pass through HTTP/2 traffic to Databricks endpoints without terminating/re-establishing the connection.
Solution 3: Reduce Result Set Size
Large result sets take longer to stream, increasing the window for connection drops. Reduce what you pull to the client:
# Instead of collecting all rows
# df.collect() # BAD -- pulls everything via gRPC
# Option A: Limit rows
df.limit(10000).collect()
# Option B: Use toPandas with Arrow (more efficient streaming)
pdf = df.limit(10000).toPandas()
# Option C: Write results to a table, then read via SQL Connector
df.write.mode("overwrite").saveAsTable("my_catalog.my_schema.results_temp")
# Then read with Databricks SQL Connector (HTTP-based, no gRPC issues)
Solution 4: Switch to Databricks SQL Connector for Result Fetching
Since the SQL Connector works on your network, use a hybrid approach -- Spark Connect for transformations, SQL Connector for result retrieval:
from databricks.connect import DatabricksSession
from databricks import sql
# Use Spark Connect for computation
spark = DatabricksSession.builder.remote(...).getOrCreate()
df = spark.sql("SELECT ... complex transformation ...")
df.write.mode("overwrite").saveAsTable("tmp.results")
# Use SQL Connector (HTTP) for result retrieval
with sql.connect(
server_hostname="<workspace>.cloud.databricks.com",
http_path="/sql/1.0/warehouses/<id>",
access_token="<pat>"
) as conn:
cursor = conn.cursor()
cursor.execute("SELECT * FROM tmp.results")
results = cursor.fetchall()
Solution 5: Increase Timeout on Network Devices
If you control the network infrastructure, increase the idle timeout on the device killing the connection:
|
Device |
Setting |
Recommended Value |
|
AWS ALB/NLB |
Idle timeout |
300-3600 seconds |
|
Azure Application Gateway |
Connection idle timeout |
300+ seconds |
|
Squid Proxy |
connect_timeout / read_timeout |
3600 seconds |
|
Zscaler |
SSL inspection timeout |
Bypass for Databricks |
|
Corporate Firewall |
TCP idle timeout |
3600 seconds |
Solution 6: Use SSL Certificate Path (If TLS Issues)
If your network uses TLS inspection (MITM proxy), the gRPC channel may fail silently:
export GRPC_DEFAULT_SSL_ROOTS_FILE_PATH="/path/to/corporate-ca-bundle.crt"
Or add the corporate CA to Python's certificate store:
pip install certifi
cat /path/to/corporate-ca.pem >> $(python -c "import certifi; print(certifi.where())")
Why SQL Connector Works but Spark Connect Doesn't
|
Feature |
Spark Connect (gRPC) |
SQL Connector (HTTP) |
|
Protocol |
HTTP/2 long-lived stream |
HTTP/1.1 request/response |
|
Connection |
Persistent bidirectional |
Short-lived poll-based |
|
During execution |
Connection appears idle |
No connection held open |
|
Result delivery |
Server pushes via stream |
Client polls for results |
|
Proxy compatibility |
Poor (many proxies break HTTP/2) |
Excellent |
The SQL Connector's poll-based model is inherently more resilient to network intermediaries because it doesn't maintain a long-lived connection that can be killed.
When to File a Support Ticket
If none of the above solutions work, file a Databricks support ticket with:
- Workspace ID and region
- Query IDs of failed queries (from query history)
- The resultfetchduration_ms = 0 observation
- Network topology diagram (client to Databricks path)
- gRPC debug logs (SPARKCONNECTLOG_LEVEL=debug)
- Confirmation that SQL Connector works on same network
This may be a platform-level issue that Databricks engineering needs to investigate, especially if the gRPC stream termination is happening within Databricks' own infrastructure rather than in your network.
References
- Databricks Connect Advanced Usage
- Databricks Connect Troubleshooting
- Query Interruptions with Databricks Connect
- gRPC Keepalive Guide
- Internal gRPC Errors -- Databricks Community
Solutions Engineer @ Databricks