Databricks Community

JTBS · a month ago

I have Python 3.12 Pod in AKS using DatabricksConnect 18.1.1 connecting to Databricks cluster 18.1.

All works great and normally I see no issues running series of Spark queries

But once a while, even without any load on dedicated cluster we have, query that normally completes under 10 seconds - does not return and will continue to show waiting on client side in AKS - even after 30 mins.

This seems like client call is hanging - not recognizing any issues with gRPC/Network or something else in between. Cluster health seems to be ok

Its not easily reproducible. Currently I have no timeouts set.

There is suggestion to use "databricks_http_timeout_seconds" as it seems like there is no default timeout set - any network errors are not picked up and client call is simply waiting. If I use this timeout , I am hoping to get failure at least in reasonable time and I can retry.

There were also suggestions to set gRPC keepalive that might fix these network specific issues: (Ref: https://community.databricks.com/t5/data-engineering/databricks-connect-serverless-grpc-issue/td-p/1...)

Can anyone suggest if this issue is noticed and will timeout and mainly "databricks_http_timeout_seconds" will fix this issue. OR there other suggestions that might help?

balajij8 · a month ago

The execution and result streaming generally happens over the gRPC route. You can force the gRPC route to send periodic frames to keep the connection look active in the AKS network infrastructure side.

You can add the following variables into the AKS Pod manifest before initializing the Databricks Session.

os.environ["GRPC_KEEPALIVE_TIME_MS"] = "30000"  # 30 seconds
os.environ["GRPC_KEEPALIVE_TIMEOUT_MS"] = "10000"  # 10 seconds
os.environ["GRPC_KEEPALIVE_PERMIT_WITHOUT_CALLS"] = "1"
os.environ["GRPC_HTTP2_MAX_PINGS_WITHOUT_DATA"] = "0"

You can pass them as headers during session creation based on specific builder implementation.

You can check below

AKS Timeouts - You can increase the default idle time out of Azure NAT Gateway if possible to 15 minutes to give queries more time
Enable gRPC Logging - Check for connection resets, stream closures or EOF errors in the logs
Application-Level Timeouts: You can implement application level timeouts in the code (concurrent.futures or asyncio). It can ensure the pipeline fails gracefully and can trigger a retry mechanism than hanging an AKS pod indefinitely.
Cluster Configuration - You can add the configurations - spark.databricks.service.server.enabled & spark.sql.execution.arrow.pyspark.enabled as true

View solution in original post

balajij8 · a month ago

The execution and result streaming generally happens over the gRPC route. You can force the gRPC route to send periodic frames to keep the connection look active in the AKS network infrastructure side.

You can add the following variables into the AKS Pod manifest before initializing the Databricks Session.

os.environ["GRPC_KEEPALIVE_TIME_MS"] = "30000"  # 30 seconds
os.environ["GRPC_KEEPALIVE_TIMEOUT_MS"] = "10000"  # 10 seconds
os.environ["GRPC_KEEPALIVE_PERMIT_WITHOUT_CALLS"] = "1"
os.environ["GRPC_HTTP2_MAX_PINGS_WITHOUT_DATA"] = "0"

You can pass them as headers during session creation based on specific builder implementation.

You can check below

AKS Timeouts - You can increase the default idle time out of Azure NAT Gateway if possible to 15 minutes to give queries more time
Enable gRPC Logging - Check for connection resets, stream closures or EOF errors in the logs
Application-Level Timeouts: You can implement application level timeouts in the code (concurrent.futures or asyncio). It can ensure the pipeline fails gracefully and can trigger a retry mechanism than hanging an AKS pod indefinitely.
Cluster Configuration - You can add the configurations - spark.databricks.service.server.enabled & spark.sql.execution.arrow.pyspark.enabled as true

Databricks Community

DatabricksConnect from Python/AKS environment calling Databricks Cluster: Spark Query Call Hangs

🌟 Community Pulse: Your Weekly Roundup! July 06 – 12, 2026

Upcoming Community BrickTalk | Sports Analytics: Turning Tracking Data into Real-Time AI Decisions

How to Optimize Your Content for GEO: Best Practices for Writing Discoverable Community Content

Solution Accelerator Series | Building Common Sense Product Recommendations With LLMs

Databricks Community Fellows – June 2026 Recap