anuj_lathi
Databricks Employee
Databricks Employee

This is a well-known class of issue with gRPC/HTTP2 long-lived streams being killed by network intermediaries. The fact that the Databricks SQL Connector (poll-based HTTP/1.1) works perfectly while Spark Connect (gRPC/HTTP2 streaming) fails is the key diagnostic clue.

Root Cause: Network Intermediaries Killing HTTP/2 Streams

Databricks Connect uses gRPC over HTTP/2, which maintains a long-lived streaming connection. During query execution on the server, this connection appears idle from the network's perspective (no data flowing client-ward). Network devices between your client and Databricks -- corporate proxies, firewalls, load balancers, WAFs, or NAT gateways -- often have idle connection timeouts that terminate connections they consider inactive.

The failure sequence:

  1. Client opens gRPC stream to Databricks serverless
  2. Query executes on server (takes N seconds/minutes)
  3. During execution, the gRPC stream is "idle" (no response data yet)
  4. Network intermediary kills the "idle" HTTP/2 connection
  5. Server finishes query, tries to stream results back
  6. Connection is already dead -- results have nowhere to go
  7. Client never receives data, eventually times out and cancels

 

This explains why resultfetchduration_ms = 0 -- the result delivery channel was severed before streaming could begin.

Diagnostic Steps

Step 1: Confirm the network theory

Test from a machine with direct internet access (no corporate proxy/VPN):

from databricks.connect import DatabricksSession

 

spark = DatabricksSession.builder.remote(

    host="https://<workspace>.cloud.databricks.com",

    token="<pat>",

    cluster_id="serverless"

).getOrCreate()

 

# Run a query that takes 30+ seconds

df = spark.sql("SELECT *, sha2(cast(id as string), 256) FROM range(10000000)")

result = df.collect()

print(f"Got {len(result)} rows")

 

If this works from a clean network but fails from your corporate network, the issue is confirmed as a network intermediary.

Step 2: Check for proxies

# Check if HTTP/HTTPS proxy is configured

echo $HTTP_PROXY $HTTPS_PROXY $http_proxy $https_proxy

 

# Check if a corporate proxy intercepts traffic

curl -v https://<workspace>.cloud.databricks.com 2>&1 | grep -i proxy

 

Step 3: Enable gRPC debug logging

export SPARK_CONNECT_LOG_LEVEL=debug

export GRPC_TRACE=all

export GRPC_VERBOSITY=DEBUG

 

Then run your query and look for connection reset, stream closed, or EOF errors in the logs.

Solutions

Solution 1: Configure gRPC Keepalive (Most Effective)

Force the gRPC channel to send periodic PING frames, preventing intermediaries from treating the connection as idle:

import grpc

from databricks.connect import DatabricksSession

 

# Configure keepalive options

spark = DatabricksSession.builder.remote(

    host="https://<workspace>.cloud.databricks.com",

    token="<pat>",

    cluster_id="serverless"

).header("grpc-keepalive-time-ms", "10000") \

 .header("grpc-keepalive-timeout-ms", "5000") \

 .getOrCreate()

 

If custom headers don't work for keepalive, try setting environment variables before creating the session:

import os

os.environ["GRPC_KEEPALIVE_TIME_MS"] = "10000"       # Send ping every 10s

os.environ["GRPC_KEEPALIVE_TIMEOUT_MS"] = "5000"      # Wait 5s for pong

os.environ["GRPC_KEEPALIVE_PERMIT_WITHOUT_CALLS"] = "1"  # Ping even when idle

os.environ["GRPC_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS"] = "5000"

 

Solution 2: Bypass Corporate Proxy for Databricks Traffic

If you're behind a corporate proxy, configure a proxy bypass for your Databricks workspace:

# Add to your environment

export NO_PROXY=".cloud.databricks.com,.azuredatabricks.net"

 

Or configure your proxy (Squid, Zscaler, etc.) to pass through HTTP/2 traffic to Databricks endpoints without terminating/re-establishing the connection.

Solution 3: Reduce Result Set Size

Large result sets take longer to stream, increasing the window for connection drops. Reduce what you pull to the client:

# Instead of collecting all rows

# df.collect()  # BAD -- pulls everything via gRPC

 

# Option A: Limit rows

df.limit(10000).collect()

 

# Option B: Use toPandas with Arrow (more efficient streaming)

pdf = df.limit(10000).toPandas()

 

# Option C: Write results to a table, then read via SQL Connector

df.write.mode("overwrite").saveAsTable("my_catalog.my_schema.results_temp")

# Then read with Databricks SQL Connector (HTTP-based, no gRPC issues)

 

Solution 4: Switch to Databricks SQL Connector for Result Fetching

Since the SQL Connector works on your network, use a hybrid approach -- Spark Connect for transformations, SQL Connector for result retrieval:

from databricks.connect import DatabricksSession

from databricks import sql

 

# Use Spark Connect for computation

spark = DatabricksSession.builder.remote(...).getOrCreate()

df = spark.sql("SELECT ... complex transformation ...")

df.write.mode("overwrite").saveAsTable("tmp.results")

 

# Use SQL Connector (HTTP) for result retrieval

with sql.connect(

    server_hostname="<workspace>.cloud.databricks.com",

    http_path="/sql/1.0/warehouses/<id>",

    access_token="<pat>"

) as conn:

    cursor = conn.cursor()

    cursor.execute("SELECT * FROM tmp.results")

    results = cursor.fetchall()

 

Solution 5: Increase Timeout on Network Devices

If you control the network infrastructure, increase the idle timeout on the device killing the connection:

 

Device

Setting

Recommended Value

AWS ALB/NLB

Idle timeout

300-3600 seconds

Azure Application Gateway

Connection idle timeout

300+ seconds

Squid Proxy

connect_timeout / read_timeout

3600 seconds

Zscaler

SSL inspection timeout

Bypass for Databricks

Corporate Firewall

TCP idle timeout

3600 seconds

Solution 6: Use SSL Certificate Path (If TLS Issues)

If your network uses TLS inspection (MITM proxy), the gRPC channel may fail silently:

export GRPC_DEFAULT_SSL_ROOTS_FILE_PATH="/path/to/corporate-ca-bundle.crt"

 

Or add the corporate CA to Python's certificate store:

pip install certifi

cat /path/to/corporate-ca.pem >> $(python -c "import certifi; print(certifi.where())")

 

Why SQL Connector Works but Spark Connect Doesn't

 

Feature

Spark Connect (gRPC)

SQL Connector (HTTP)

Protocol

HTTP/2 long-lived stream

HTTP/1.1 request/response

Connection

Persistent bidirectional

Short-lived poll-based

During execution

Connection appears idle

No connection held open

Result delivery

Server pushes via stream

Client polls for results

Proxy compatibility

Poor (many proxies break HTTP/2)

Excellent

The SQL Connector's poll-based model is inherently more resilient to network intermediaries because it doesn't maintain a long-lived connection that can be killed.

When to File a Support Ticket

If none of the above solutions work, file a Databricks support ticket with:

  • Workspace ID and region
  • Query IDs of failed queries (from query history)
  • The resultfetchduration_ms = 0 observation
  • Network topology diagram (client to Databricks path)
  • gRPC debug logs (SPARKCONNECTLOG_LEVEL=debug)
  • Confirmation that SQL Connector works on same network

This may be a platform-level issue that Databricks engineering needs to investigate, especially if the gRPC stream termination is happening within Databricks' own infrastructure rather than in your network.

References

Anuj Lathi
Solutions Engineer @ Databricks

View solution in original post