cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

databricks-connect serverless GRPC issue

subray
Visitor

Queries executed via Databricks Connect v17 (Spark Connect / gRPC) on
serverless compute COMPLETE SUCCESSFULLY on the server side (Spark tasks
finish, results are produced), but the Spark Connect gRPC channel FAILS
TO DELIVER results back to the client application. The client receives
nothing, waits, and eventually cancels the query after its timeout.

This issue is 100% exclusive to Spark Connect. The Databricks SQL
Connector (poll-based HTTP) on the same data, same network, same user
has ZERO cancellations.

ENVIRONMENT:
------------
โ€ข databricks-connect version: 17 (latest)
โ€ข Client: External Python application via Databricks Connect
โ€ข Compute: Serverless (SERVERLESS_COMPUTE)
โ€ข Protocol: SPARK_CONNECT (gRPC / HTTP2)

EXACT FAILURE FLOW:
-------------------
1. Client app sends query via Databricks Connect (gRPC) โ†’ serverless
2. Serverless executes query โ€” Spark tasks complete, results produced
3. *** Server FAILS to stream results back via gRPC ***
(result_fetch_duration_ms = 0 โ€” result delivery never starts)
4. Client waits... receives nothing... hits app timeout
5. Client cancels query/session
6. Query recorded as CANCELED in query history

3 REPLIES 3

Sumit_7
Honored Contributor II

@subray Have you tried limiting the data to see if works?

subray
Visitor

subray_0-1775795972722.png

yes  i can see query completes at dat bricks side result are generated but not returned

anuj_lathi
Databricks Employee
Databricks Employee

This is a well-known class of issue with gRPC/HTTP2 long-lived streams being killed by network intermediaries. The fact that the Databricks SQL Connector (poll-based HTTP/1.1) works perfectly while Spark Connect (gRPC/HTTP2 streaming) fails is the key diagnostic clue.

Root Cause: Network Intermediaries Killing HTTP/2 Streams

Databricks Connect uses gRPC over HTTP/2, which maintains a long-lived streaming connection. During query execution on the server, this connection appears idle from the network's perspective (no data flowing client-ward). Network devices between your client and Databricks -- corporate proxies, firewalls, load balancers, WAFs, or NAT gateways -- often have idle connection timeouts that terminate connections they consider inactive.

The failure sequence:

  1. Client opens gRPC stream to Databricks serverless
  2. Query executes on server (takes N seconds/minutes)
  3. During execution, the gRPC stream is "idle" (no response data yet)
  4. Network intermediary kills the "idle" HTTP/2 connection
  5. Server finishes query, tries to stream results back
  6. Connection is already dead -- results have nowhere to go
  7. Client never receives data, eventually times out and cancels

 

This explains why resultfetchduration_ms = 0 -- the result delivery channel was severed before streaming could begin.

Diagnostic Steps

Step 1: Confirm the network theory

Test from a machine with direct internet access (no corporate proxy/VPN):

from databricks.connect import DatabricksSession

 

spark = DatabricksSession.builder.remote(

    host="https://<workspace>.cloud.databricks.com",

    token="<pat>",

    cluster_id="serverless"

).getOrCreate()

 

# Run a query that takes 30+ seconds

df = spark.sql("SELECT *, sha2(cast(id as string), 256) FROM range(10000000)")

result = df.collect()

print(f"Got {len(result)} rows")

 

If this works from a clean network but fails from your corporate network, the issue is confirmed as a network intermediary.

Step 2: Check for proxies

# Check if HTTP/HTTPS proxy is configured

echo $HTTP_PROXY $HTTPS_PROXY $http_proxy $https_proxy

 

# Check if a corporate proxy intercepts traffic

curl -v https://<workspace>.cloud.databricks.com 2>&1 | grep -i proxy

 

Step 3: Enable gRPC debug logging

export SPARK_CONNECT_LOG_LEVEL=debug

export GRPC_TRACE=all

export GRPC_VERBOSITY=DEBUG

 

Then run your query and look for connection reset, stream closed, or EOF errors in the logs.

Solutions

Solution 1: Configure gRPC Keepalive (Most Effective)

Force the gRPC channel to send periodic PING frames, preventing intermediaries from treating the connection as idle:

import grpc

from databricks.connect import DatabricksSession

 

# Configure keepalive options

spark = DatabricksSession.builder.remote(

    host="https://<workspace>.cloud.databricks.com",

    token="<pat>",

    cluster_id="serverless"

).header("grpc-keepalive-time-ms", "10000") \

 .header("grpc-keepalive-timeout-ms", "5000") \

 .getOrCreate()

 

If custom headers don't work for keepalive, try setting environment variables before creating the session:

import os

os.environ["GRPC_KEEPALIVE_TIME_MS"] = "10000"       # Send ping every 10s

os.environ["GRPC_KEEPALIVE_TIMEOUT_MS"] = "5000"      # Wait 5s for pong

os.environ["GRPC_KEEPALIVE_PERMIT_WITHOUT_CALLS"] = "1"  # Ping even when idle

os.environ["GRPC_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS"] = "5000"

 

Solution 2: Bypass Corporate Proxy for Databricks Traffic

If you're behind a corporate proxy, configure a proxy bypass for your Databricks workspace:

# Add to your environment

export NO_PROXY=".cloud.databricks.com,.azuredatabricks.net"

 

Or configure your proxy (Squid, Zscaler, etc.) to pass through HTTP/2 traffic to Databricks endpoints without terminating/re-establishing the connection.

Solution 3: Reduce Result Set Size

Large result sets take longer to stream, increasing the window for connection drops. Reduce what you pull to the client:

# Instead of collecting all rows

# df.collect()  # BAD -- pulls everything via gRPC

 

# Option A: Limit rows

df.limit(10000).collect()

 

# Option B: Use toPandas with Arrow (more efficient streaming)

pdf = df.limit(10000).toPandas()

 

# Option C: Write results to a table, then read via SQL Connector

df.write.mode("overwrite").saveAsTable("my_catalog.my_schema.results_temp")

# Then read with Databricks SQL Connector (HTTP-based, no gRPC issues)

 

Solution 4: Switch to Databricks SQL Connector for Result Fetching

Since the SQL Connector works on your network, use a hybrid approach -- Spark Connect for transformations, SQL Connector for result retrieval:

from databricks.connect import DatabricksSession

from databricks import sql

 

# Use Spark Connect for computation

spark = DatabricksSession.builder.remote(...).getOrCreate()

df = spark.sql("SELECT ... complex transformation ...")

df.write.mode("overwrite").saveAsTable("tmp.results")

 

# Use SQL Connector (HTTP) for result retrieval

with sql.connect(

    server_hostname="<workspace>.cloud.databricks.com",

    http_path="/sql/1.0/warehouses/<id>",

    access_token="<pat>"

) as conn:

    cursor = conn.cursor()

    cursor.execute("SELECT * FROM tmp.results")

    results = cursor.fetchall()

 

Solution 5: Increase Timeout on Network Devices

If you control the network infrastructure, increase the idle timeout on the device killing the connection:

 

Device

Setting

Recommended Value

AWS ALB/NLB

Idle timeout

300-3600 seconds

Azure Application Gateway

Connection idle timeout

300+ seconds

Squid Proxy

connect_timeout / read_timeout

3600 seconds

Zscaler

SSL inspection timeout

Bypass for Databricks

Corporate Firewall

TCP idle timeout

3600 seconds

Solution 6: Use SSL Certificate Path (If TLS Issues)

If your network uses TLS inspection (MITM proxy), the gRPC channel may fail silently:

export GRPC_DEFAULT_SSL_ROOTS_FILE_PATH="/path/to/corporate-ca-bundle.crt"

 

Or add the corporate CA to Python's certificate store:

pip install certifi

cat /path/to/corporate-ca.pem >> $(python -c "import certifi; print(certifi.where())")

 

Why SQL Connector Works but Spark Connect Doesn't

 

Feature

Spark Connect (gRPC)

SQL Connector (HTTP)

Protocol

HTTP/2 long-lived stream

HTTP/1.1 request/response

Connection

Persistent bidirectional

Short-lived poll-based

During execution

Connection appears idle

No connection held open

Result delivery

Server pushes via stream

Client polls for results

Proxy compatibility

Poor (many proxies break HTTP/2)

Excellent

The SQL Connector's poll-based model is inherently more resilient to network intermediaries because it doesn't maintain a long-lived connection that can be killed.

When to File a Support Ticket

If none of the above solutions work, file a Databricks support ticket with:

  • Workspace ID and region
  • Query IDs of failed queries (from query history)
  • The resultfetchduration_ms = 0 observation
  • Network topology diagram (client to Databricks path)
  • gRPC debug logs (SPARKCONNECTLOG_LEVEL=debug)
  • Confirmation that SQL Connector works on same network

This may be a platform-level issue that Databricks engineering needs to investigate, especially if the gRPC stream termination is happening within Databricks' own infrastructure rather than in your network.

References

Anuj Lathi
Solutions Engineer @ Databricks