cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

DatabricksConnect from Python/AKS environment calling Databricks Cluster: Spark Query Call Hangs

JTBS
New Contributor

I have Python 3.12 Pod in AKS using DatabricksConnect 18.1.1 connecting to Databricks cluster 18.1.

All works great and normally I see no issues running series of Spark queries 

But once a while, even without any load on dedicated cluster we have, query that normally completes under 10 seconds - does not return and will continue to show waiting on client side in AKS - even after 30 mins.

This seems like client call is hanging - not recognizing any issues with gRPC/Network or something else in between. Cluster health seems to be ok

Its not easily reproducible. Currently I have no timeouts set.

There is suggestion to use "databricks_http_timeout_seconds" as it seems like there is no default timeout set - any network errors are not picked up and client call is simply waiting. If I use this timeout , I am hoping to get failure at least in reasonable time and I can retry.

There were also suggestions to set gRPC keepalive that might fix these network specific issues: (Ref: https://community.databricks.com/t5/data-engineering/databricks-connect-serverless-grpc-issue/td-p/1...)

Can anyone suggest if this issue is noticed and will timeout and mainly "databricks_http_timeout_seconds" will fix this issue. OR there other suggestions that might help?

1 ACCEPTED SOLUTION

Accepted Solutions

balajij8
Contributor III

The execution and result streaming generally happens over the gRPC route. You can force the gRPC route to send periodic frames to keep the connection look active in the AKS network infrastructure side.

You can add the following variables into the AKS Pod manifest before initializing the Databricks Session. 

os.environ["GRPC_KEEPALIVE_TIME_MS"] = "30000"  # 30 seconds
os.environ["GRPC_KEEPALIVE_TIMEOUT_MS"] = "10000"  # 10 seconds
os.environ["GRPC_KEEPALIVE_PERMIT_WITHOUT_CALLS"] = "1"
os.environ["GRPC_HTTP2_MAX_PINGS_WITHOUT_DATA"] = "0"

You can pass them as headers during session creation based on specific builder implementation.

You can check below

  • AKS Timeouts - You can increase the default idle time out of Azure NAT Gateway if possible to 15 minutes to give queries more time
  • Enable gRPC Logging - Check for connection resets, stream closures or EOF errors in the logs
  • Application-Level Timeouts: You can implement application level timeouts in the code (concurrent.futures or asyncio). It can ensure the pipeline fails gracefully and can trigger a retry mechanism than hanging an AKS pod indefinitely.
  • Cluster Configuration - You can add the configurations - spark.databricks.service.server.enabled & spark.sql.execution.arrow.pyspark.enabled as true

View solution in original post

1 REPLY 1

balajij8
Contributor III

The execution and result streaming generally happens over the gRPC route. You can force the gRPC route to send periodic frames to keep the connection look active in the AKS network infrastructure side.

You can add the following variables into the AKS Pod manifest before initializing the Databricks Session. 

os.environ["GRPC_KEEPALIVE_TIME_MS"] = "30000"  # 30 seconds
os.environ["GRPC_KEEPALIVE_TIMEOUT_MS"] = "10000"  # 10 seconds
os.environ["GRPC_KEEPALIVE_PERMIT_WITHOUT_CALLS"] = "1"
os.environ["GRPC_HTTP2_MAX_PINGS_WITHOUT_DATA"] = "0"

You can pass them as headers during session creation based on specific builder implementation.

You can check below

  • AKS Timeouts - You can increase the default idle time out of Azure NAT Gateway if possible to 15 minutes to give queries more time
  • Enable gRPC Logging - Check for connection resets, stream closures or EOF errors in the logs
  • Application-Level Timeouts: You can implement application level timeouts in the code (concurrent.futures or asyncio). It can ensure the pipeline fails gracefully and can trigger a retry mechanism than hanging an AKS pod indefinitely.
  • Cluster Configuration - You can add the configurations - spark.databricks.service.server.enabled & spark.sql.execution.arrow.pyspark.enabled as true