<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: DatabricksConnect from Python/AKS environment calling Databricks Cluster: Spark Query Call Hangs in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/databricksconnect-from-python-aks-environment-calling-databricks/m-p/159154#M54794</link>
    <description>&lt;P&gt;&lt;SPAN class=""&gt;The execution and result streaming generally happens over the &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;gRPC &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN class=""&gt;route. You can&amp;nbsp;&lt;SPAN&gt;force the gRPC route to send periodic frames&amp;nbsp;to keep the connection look active in the AKS network infrastructure side.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;You can add the following variables into the AKS Pod manifest before initializing the Databricks Session.&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;os.environ["GRPC_KEEPALIVE_TIME_MS"] = "30000"  # 30 seconds
os.environ["GRPC_KEEPALIVE_TIMEOUT_MS"] = "10000"  # 10 seconds
os.environ["GRPC_KEEPALIVE_PERMIT_WITHOUT_CALLS"] = "1"
os.environ["GRPC_HTTP2_MAX_PINGS_WITHOUT_DATA"] = "0"&lt;/LI-CODE&gt;&lt;P&gt;&lt;SPAN&gt;You can pass them as headers during session creation based on specific builder implementation.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;You can check below&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;AKS Timeouts - &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN class=""&gt;You can increase the default idle time out of&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;Azure NAT Gateway if possible to 15 minutes to give queries more time&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;&lt;STRONG&gt;Enable gRPC Logging -&amp;nbsp;&lt;/STRONG&gt;Check for connection resets, stream closures or EOF errors in the logs&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;&lt;STRONG&gt;Application-Level Timeouts&lt;/STRONG&gt;: You can implement application level timeouts in the code (concurrent.futures or asyncio). It can ensure the pipeline fails gracefully and can trigger a retry mechanism than hanging an AKS pod indefinitely.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;Cluster Configuration -&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN class=""&gt;You can add the configurations -&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;spark.databricks.service.server.enabled &amp;amp;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;spark.sql.execution.arrow.pyspark.enabled as true&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;</description>
    <pubDate>Tue, 16 Jun 2026 13:28:49 GMT</pubDate>
    <dc:creator>balajij8</dc:creator>
    <dc:date>2026-06-16T13:28:49Z</dc:date>
    <item>
      <title>DatabricksConnect from Python/AKS environment calling Databricks Cluster: Spark Query Call Hangs</title>
      <link>https://community.databricks.com/t5/data-engineering/databricksconnect-from-python-aks-environment-calling-databricks/m-p/159093#M54788</link>
      <description>&lt;P&gt;I have Python 3.12 Pod in AKS using DatabricksConnect 18.1.1 connecting to Databricks cluster 18.1.&lt;/P&gt;&lt;P&gt;All works great and normally I see no issues running series of Spark queries&amp;nbsp;&lt;/P&gt;&lt;P&gt;But once a while, even without any load on dedicated cluster we have, query that normally completes under 10 seconds - does not return and will continue to show waiting on client side in AKS - even after 30 mins.&lt;/P&gt;&lt;P&gt;This seems like client call is hanging - not recognizing any issues with gRPC/Network or something else in between. Cluster health seems to be ok&lt;/P&gt;&lt;P&gt;Its not easily reproducible. Currently I have no timeouts set.&lt;/P&gt;&lt;P&gt;There is suggestion to use "&lt;SPAN&gt;databricks_http_timeout_seconds" as it seems like there is no default timeout set - any network errors are not picked up and client call is simply waiting. If I use this timeout , I am hoping to get failure at least in reasonable time and I can retry.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;There were also suggestions to set gRPC keepalive that might fix these network specific issues: (Ref:&amp;nbsp;&lt;A href="https://community.databricks.com/t5/data-engineering/databricks-connect-serverless-grpc-issue/td-p/154016" target="_blank"&gt;https://community.databricks.com/t5/data-engineering/databricks-connect-serverless-grpc-issue/td-p/154016&lt;/A&gt;)&lt;/P&gt;&lt;P&gt;Can anyone suggest if this issue is noticed and will timeout and mainly "&lt;SPAN&gt;databricks_http_timeout_seconds" will fix this issue. OR there other suggestions that might help?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 15 Jun 2026 23:30:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricksconnect-from-python-aks-environment-calling-databricks/m-p/159093#M54788</guid>
      <dc:creator>JTBS</dc:creator>
      <dc:date>2026-06-15T23:30:03Z</dc:date>
    </item>
    <item>
      <title>Re: DatabricksConnect from Python/AKS environment calling Databricks Cluster: Spark Query Call Hangs</title>
      <link>https://community.databricks.com/t5/data-engineering/databricksconnect-from-python-aks-environment-calling-databricks/m-p/159154#M54794</link>
      <description>&lt;P&gt;&lt;SPAN class=""&gt;The execution and result streaming generally happens over the &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;gRPC &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN class=""&gt;route. You can&amp;nbsp;&lt;SPAN&gt;force the gRPC route to send periodic frames&amp;nbsp;to keep the connection look active in the AKS network infrastructure side.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;You can add the following variables into the AKS Pod manifest before initializing the Databricks Session.&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;os.environ["GRPC_KEEPALIVE_TIME_MS"] = "30000"  # 30 seconds
os.environ["GRPC_KEEPALIVE_TIMEOUT_MS"] = "10000"  # 10 seconds
os.environ["GRPC_KEEPALIVE_PERMIT_WITHOUT_CALLS"] = "1"
os.environ["GRPC_HTTP2_MAX_PINGS_WITHOUT_DATA"] = "0"&lt;/LI-CODE&gt;&lt;P&gt;&lt;SPAN&gt;You can pass them as headers during session creation based on specific builder implementation.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;You can check below&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;AKS Timeouts - &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN class=""&gt;You can increase the default idle time out of&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;Azure NAT Gateway if possible to 15 minutes to give queries more time&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;&lt;STRONG&gt;Enable gRPC Logging -&amp;nbsp;&lt;/STRONG&gt;Check for connection resets, stream closures or EOF errors in the logs&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;&lt;STRONG&gt;Application-Level Timeouts&lt;/STRONG&gt;: You can implement application level timeouts in the code (concurrent.futures or asyncio). It can ensure the pipeline fails gracefully and can trigger a retry mechanism than hanging an AKS pod indefinitely.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;Cluster Configuration -&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN class=""&gt;You can add the configurations -&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;spark.databricks.service.server.enabled &amp;amp;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;spark.sql.execution.arrow.pyspark.enabled as true&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Tue, 16 Jun 2026 13:28:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricksconnect-from-python-aks-environment-calling-databricks/m-p/159154#M54794</guid>
      <dc:creator>balajij8</dc:creator>
      <dc:date>2026-06-16T13:28:49Z</dc:date>
    </item>
  </channel>
</rss>

