yesterday
Queries executed via Databricks Connect v17 (Spark Connect / gRPC) on
serverless compute COMPLETE SUCCESSFULLY on the server side (Spark tasks
finish, results are produced), but the Spark Connect gRPC channel FAILS
TO DELIVER results back to the client application. The client receives
nothing, waits, and eventually cancels the query after its timeout.
This issue is 100% exclusive to Spark Connect. The Databricks SQL
Connector (poll-based HTTP) on the same data, same network, same user
has ZERO cancellations.
ENVIRONMENT:
------------
โข databricks-connect version: 17 (latest)
โข Client: External Python application via Databricks Connect
โข Compute: Serverless (SERVERLESS_COMPUTE)
โข Protocol: SPARK_CONNECT (gRPC / HTTP2)
EXACT FAILURE FLOW:
-------------------
1. Client app sends query via Databricks Connect (gRPC) โ serverless
2. Serverless executes query โ Spark tasks complete, results produced
3. *** Server FAILS to stream results back via gRPC ***
(result_fetch_duration_ms = 0 โ result delivery never starts)
4. Client waits... receives nothing... hits app timeout
5. Client cancels query/session
6. Query recorded as CANCELED in query history
yesterday
@subray Have you tried limiting the data to see if works?
yesterday
yes i can see query completes at dat bricks side result are generated but not returned
yesterday
This is a well-known class of issue with gRPC/HTTP2 long-lived streams being killed by network intermediaries. The fact that the Databricks SQL Connector (poll-based HTTP/1.1) works perfectly while Spark Connect (gRPC/HTTP2 streaming) fails is the key diagnostic clue.
Databricks Connect uses gRPC over HTTP/2, which maintains a long-lived streaming connection. During query execution on the server, this connection appears idle from the network's perspective (no data flowing client-ward). Network devices between your client and Databricks -- corporate proxies, firewalls, load balancers, WAFs, or NAT gateways -- often have idle connection timeouts that terminate connections they consider inactive.
The failure sequence:
This explains why resultfetchduration_ms = 0 -- the result delivery channel was severed before streaming could begin.
Step 1: Confirm the network theory
Test from a machine with direct internet access (no corporate proxy/VPN):
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.remote(
host="https://<workspace>.cloud.databricks.com",
token="<pat>",
cluster_id="serverless"
).getOrCreate()
# Run a query that takes 30+ seconds
df = spark.sql("SELECT *, sha2(cast(id as string), 256) FROM range(10000000)")
result = df.collect()
print(f"Got {len(result)} rows")
If this works from a clean network but fails from your corporate network, the issue is confirmed as a network intermediary.
Step 2: Check for proxies
# Check if HTTP/HTTPS proxy is configured
echo $HTTP_PROXY $HTTPS_PROXY $http_proxy $https_proxy
# Check if a corporate proxy intercepts traffic
curl -v https://<workspace>.cloud.databricks.com 2>&1 | grep -i proxy
Step 3: Enable gRPC debug logging
export SPARK_CONNECT_LOG_LEVEL=debug
export GRPC_TRACE=all
export GRPC_VERBOSITY=DEBUG
Then run your query and look for connection reset, stream closed, or EOF errors in the logs.
Force the gRPC channel to send periodic PING frames, preventing intermediaries from treating the connection as idle:
import grpc
from databricks.connect import DatabricksSession
# Configure keepalive options
spark = DatabricksSession.builder.remote(
host="https://<workspace>.cloud.databricks.com",
token="<pat>",
cluster_id="serverless"
).header("grpc-keepalive-time-ms", "10000") \
.header("grpc-keepalive-timeout-ms", "5000") \
.getOrCreate()
If custom headers don't work for keepalive, try setting environment variables before creating the session:
import os
os.environ["GRPC_KEEPALIVE_TIME_MS"] = "10000" # Send ping every 10s
os.environ["GRPC_KEEPALIVE_TIMEOUT_MS"] = "5000" # Wait 5s for pong
os.environ["GRPC_KEEPALIVE_PERMIT_WITHOUT_CALLS"] = "1" # Ping even when idle
os.environ["GRPC_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS"] = "5000"
If you're behind a corporate proxy, configure a proxy bypass for your Databricks workspace:
# Add to your environment
export NO_PROXY=".cloud.databricks.com,.azuredatabricks.net"
Or configure your proxy (Squid, Zscaler, etc.) to pass through HTTP/2 traffic to Databricks endpoints without terminating/re-establishing the connection.
Large result sets take longer to stream, increasing the window for connection drops. Reduce what you pull to the client:
# Instead of collecting all rows
# df.collect() # BAD -- pulls everything via gRPC
# Option A: Limit rows
df.limit(10000).collect()
# Option B: Use toPandas with Arrow (more efficient streaming)
pdf = df.limit(10000).toPandas()
# Option C: Write results to a table, then read via SQL Connector
df.write.mode("overwrite").saveAsTable("my_catalog.my_schema.results_temp")
# Then read with Databricks SQL Connector (HTTP-based, no gRPC issues)
Since the SQL Connector works on your network, use a hybrid approach -- Spark Connect for transformations, SQL Connector for result retrieval:
from databricks.connect import DatabricksSession
from databricks import sql
# Use Spark Connect for computation
spark = DatabricksSession.builder.remote(...).getOrCreate()
df = spark.sql("SELECT ... complex transformation ...")
df.write.mode("overwrite").saveAsTable("tmp.results")
# Use SQL Connector (HTTP) for result retrieval
with sql.connect(
server_hostname="<workspace>.cloud.databricks.com",
http_path="/sql/1.0/warehouses/<id>",
access_token="<pat>"
) as conn:
cursor = conn.cursor()
cursor.execute("SELECT * FROM tmp.results")
results = cursor.fetchall()
If you control the network infrastructure, increase the idle timeout on the device killing the connection:
|
Device |
Setting |
Recommended Value |
|
AWS ALB/NLB |
Idle timeout |
300-3600 seconds |
|
Azure Application Gateway |
Connection idle timeout |
300+ seconds |
|
Squid Proxy |
connect_timeout / read_timeout |
3600 seconds |
|
Zscaler |
SSL inspection timeout |
Bypass for Databricks |
|
Corporate Firewall |
TCP idle timeout |
3600 seconds |
If your network uses TLS inspection (MITM proxy), the gRPC channel may fail silently:
export GRPC_DEFAULT_SSL_ROOTS_FILE_PATH="/path/to/corporate-ca-bundle.crt"
Or add the corporate CA to Python's certificate store:
pip install certifi
cat /path/to/corporate-ca.pem >> $(python -c "import certifi; print(certifi.where())")
|
Feature |
Spark Connect (gRPC) |
SQL Connector (HTTP) |
|
Protocol |
HTTP/2 long-lived stream |
HTTP/1.1 request/response |
|
Connection |
Persistent bidirectional |
Short-lived poll-based |
|
During execution |
Connection appears idle |
No connection held open |
|
Result delivery |
Server pushes via stream |
Client polls for results |
|
Proxy compatibility |
Poor (many proxies break HTTP/2) |
Excellent |
The SQL Connector's poll-based model is inherently more resilient to network intermediaries because it doesn't maintain a long-lived connection that can be killed.
If none of the above solutions work, file a Databricks support ticket with:
This may be a platform-level issue that Databricks engineering needs to investigate, especially if the gRPC stream termination is happening within Databricks' own infrastructure rather than in your network.