I have Python 3.12 Pod in AKS using DatabricksConnect 18.1.1 connecting to Databricks cluster 18.1.
All works great and normally I see no issues running series of Spark queries
But once a while, even without any load on dedicated cluster we have, query that normally completes under 10 seconds - does not return and will continue to show waiting on client side in AKS - even after 30 mins.
This seems like client call is hanging - not recognizing any issues with gRPC/Network or something else in between. Cluster health seems to be ok
Its not easily reproducible. Currently I have no timeouts set.
There is suggestion to use "databricks_http_timeout_seconds" as it seems like there is no default timeout set - any network errors are not picked up and client call is simply waiting. If I use this timeout , I am hoping to get failure at least in reasonable time and I can retry.
There were also suggestions to set gRPC keepalive that might fix these network specific issues: (Ref: https://community.databricks.com/t5/data-engineering/databricks-connect-serverless-grpc-issue/td-p/1...)
Can anyone suggest if this issue is noticed and will timeout and mainly "databricks_http_timeout_seconds" will fix this issue. OR there other suggestions that might help?