Re: databricks-connect serverless GRPC issue

anuj_lathi · ‎04-09-2026

This is a well-known class of issue with gRPC/HTTP2 long-lived streams being killed by network intermediaries. The fact that the Databricks SQL Connector (poll-based HTTP/1.1) works perfectly while Spark Connect (gRPC/HTTP2 streaming) fails is the key diagnostic clue.

Root Cause: Network Intermediaries Killing HTTP/2 Streams

Databricks Connect uses gRPC over HTTP/2, which maintains a long-lived streaming connection. During query execution on the server, this connection appears idle from the network's perspective (no data flowing client-ward). Network devices between your client and Databricks -- corporate proxies, firewalls, load balancers, WAFs, or NAT gateways -- often have idle connection timeouts that terminate connections they consider inactive.

The failure sequence:

Client opens gRPC stream to Databricks serverless
Query executes on server (takes N seconds/minutes)
During execution, the gRPC stream is "idle" (no response data yet)
Network intermediary kills the "idle" HTTP/2 connection
Server finishes query, tries to stream results back
Connection is already dead -- results have nowhere to go
Client never receives data, eventually times out and cancels

This explains why resultfetchduration_ms = 0 -- the result delivery channel was severed before streaming could begin.

Diagnostic Steps

Step 1: Confirm the network theory

Test from a machine with direct internet access (no corporate proxy/VPN):

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.remote(

host="https://<workspace>.cloud.databricks.com",

token="<pat>",

cluster_id="serverless"

).getOrCreate()

# Run a query that takes 30+ seconds

df = spark.sql("SELECT *, sha2(cast(id as string), 256) FROM range(10000000)")

result = df.collect()

print(f"Got {len(result)} rows")

If this works from a clean network but fails from your corporate network, the issue is confirmed as a network intermediary.

Step 2: Check for proxies

# Check if HTTP/HTTPS proxy is configured

echo $HTTP_PROXY $HTTPS_PROXY $http_proxy $https_proxy

# Check if a corporate proxy intercepts traffic

curl -v https://<workspace>.cloud.databricks.com 2>&1 | grep -i proxy

Step 3: Enable gRPC debug logging

export SPARK_CONNECT_LOG_LEVEL=debug

export GRPC_TRACE=all

export GRPC_VERBOSITY=DEBUG

Then run your query and look for connection reset, stream closed, or EOF errors in the logs.

Solutions

Solution 1: Configure gRPC Keepalive (Most Effective)

Force the gRPC channel to send periodic PING frames, preventing intermediaries from treating the connection as idle:

import grpc

from databricks.connect import DatabricksSession

# Configure keepalive options

spark = DatabricksSession.builder.remote(

host="https://<workspace>.cloud.databricks.com",

token="<pat>",

cluster_id="serverless"

).header("grpc-keepalive-time-ms", "10000") \

.header("grpc-keepalive-timeout-ms", "5000") \

.getOrCreate()

If custom headers don't work for keepalive, try setting environment variables before creating the session:

import os

os.environ["GRPC_KEEPALIVE_TIME_MS"] = "10000" # Send ping every 10s

os.environ["GRPC_KEEPALIVE_TIMEOUT_MS"] = "5000" # Wait 5s for pong

os.environ["GRPC_KEEPALIVE_PERMIT_WITHOUT_CALLS"] = "1" # Ping even when idle

os.environ["GRPC_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS"] = "5000"

Solution 2: Bypass Corporate Proxy for Databricks Traffic

If you're behind a corporate proxy, configure a proxy bypass for your Databricks workspace:

# Add to your environment

export NO_PROXY=".cloud.databricks.com,.azuredatabricks.net"

Or configure your proxy (Squid, Zscaler, etc.) to pass through HTTP/2 traffic to Databricks endpoints without terminating/re-establishing the connection.

Solution 3: Reduce Result Set Size

Large result sets take longer to stream, increasing the window for connection drops. Reduce what you pull to the client:

# Instead of collecting all rows

# df.collect() # BAD -- pulls everything via gRPC

# Option A: Limit rows

df.limit(10000).collect()

# Option B: Use toPandas with Arrow (more efficient streaming)

pdf = df.limit(10000).toPandas()

# Option C: Write results to a table, then read via SQL Connector

df.write.mode("overwrite").saveAsTable("my_catalog.my_schema.results_temp")

# Then read with Databricks SQL Connector (HTTP-based, no gRPC issues)

Solution 4: Switch to Databricks SQL Connector for Result Fetching

Since the SQL Connector works on your network, use a hybrid approach -- Spark Connect for transformations, SQL Connector for result retrieval:

from databricks.connect import DatabricksSession

from databricks import sql

# Use Spark Connect for computation

spark = DatabricksSession.builder.remote(...).getOrCreate()

df = spark.sql("SELECT ... complex transformation ...")

df.write.mode("overwrite").saveAsTable("tmp.results")

# Use SQL Connector (HTTP) for result retrieval

with sql.connect(

server_hostname="<workspace>.cloud.databricks.com",

http_path="/sql/1.0/warehouses/<id>",

access_token="<pat>"

) as conn:

cursor = conn.cursor()

cursor.execute("SELECT * FROM tmp.results")

results = cursor.fetchall()

Solution 5: Increase Timeout on Network Devices

If you control the network infrastructure, increase the idle timeout on the device killing the connection:

Device	Setting	Recommended Value
AWS ALB/NLB	Idle timeout	300-3600 seconds
Azure Application Gateway	Connection idle timeout	300+ seconds
Squid Proxy	connect_timeout / read_timeout	3600 seconds
Zscaler	SSL inspection timeout	Bypass for Databricks
Corporate Firewall	TCP idle timeout	3600 seconds

Solution 6: Use SSL Certificate Path (If TLS Issues)

If your network uses TLS inspection (MITM proxy), the gRPC channel may fail silently:

export GRPC_DEFAULT_SSL_ROOTS_FILE_PATH="/path/to/corporate-ca-bundle.crt"

Or add the corporate CA to Python's certificate store:

pip install certifi

cat /path/to/corporate-ca.pem >> $(python -c "import certifi; print(certifi.where())")

Why SQL Connector Works but Spark Connect Doesn't

Feature	Spark Connect (gRPC)	SQL Connector (HTTP)
Protocol	HTTP/2 long-lived stream	HTTP/1.1 request/response
Connection	Persistent bidirectional	Short-lived poll-based
During execution	Connection appears idle	No connection held open
Result delivery	Server pushes via stream	Client polls for results
Proxy compatibility	Poor (many proxies break HTTP/2)	Excellent

The SQL Connector's poll-based model is inherently more resilient to network intermediaries because it doesn't maintain a long-lived connection that can be killed.

When to File a Support Ticket

If none of the above solutions work, file a Databricks support ticket with:

Workspace ID and region
Query IDs of failed queries (from query history)
The resultfetchduration_ms = 0 observation
Network topology diagram (client to Databricks path)
gRPC debug logs (SPARKCONNECTLOG_LEVEL=debug)
Confirmation that SQL Connector works on same network

This may be a platform-level issue that Databricks engineering needs to investigate, especially if the gRPC stream termination is happening within Databricks' own infrastructure rather than in your network.

References

Anuj Lathi
Solutions Engineer @ Databricks

View solution in original post