<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic StatusCode.UNIMPLEMENTED error: DatabricksConnect library using AKS/PySpark to calling Spark cluster in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/statuscode-unimplemented-error-databricksconnect-library-using/m-p/161140#M54990</link>
    <description>&lt;P&gt;I am running PySpark application in AKS/Pythgon container/pod:&lt;/P&gt;&lt;P&gt;Using Databricks 18.2.1 library with Databricks Spark cluster 18.2&lt;/P&gt;&lt;P&gt;Once a while I am getting below error:&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;InactiveRpcError of RPC that terminated with: status = StatusCode.UNIMPLEMENTED details = "Received http2 header with status: 404" debug_error_string = "UNIMPLEMENTED:Received http2 header with status: 404&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I don't see any cluster health or events that are concerning other than there are few scale up/down events. Not sure if these events OR any intermittent network issues causing any open Spark sessions to lose connectivity.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;But I thought DatabricksConnect 18.2.1 fixed handling these reconnect issues better.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;I am not exactly sure of what is triggering but I am positive its Library not able to handle some scenarios. If I run all code with-in cluster in Notebook, I don't remember seeing any issues anytime. So I am suspecting either network/scale out events combined with Library 18.2.1 not working as expected.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Appreciate if anyone faced same issues OR share some insight or workarounds to get over this.&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;Please NOTE: This happens once a while and not always. Re-runs Spark application from AKS goes without errors most of the time&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 02 Jul 2026 01:44:53 GMT</pubDate>
    <dc:creator>JTBS</dc:creator>
    <dc:date>2026-07-02T01:44:53Z</dc:date>
    <item>
      <title>StatusCode.UNIMPLEMENTED error: DatabricksConnect library using AKS/PySpark to calling Spark cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/statuscode-unimplemented-error-databricksconnect-library-using/m-p/161140#M54990</link>
      <description>&lt;P&gt;I am running PySpark application in AKS/Pythgon container/pod:&lt;/P&gt;&lt;P&gt;Using Databricks 18.2.1 library with Databricks Spark cluster 18.2&lt;/P&gt;&lt;P&gt;Once a while I am getting below error:&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;InactiveRpcError of RPC that terminated with: status = StatusCode.UNIMPLEMENTED details = "Received http2 header with status: 404" debug_error_string = "UNIMPLEMENTED:Received http2 header with status: 404&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I don't see any cluster health or events that are concerning other than there are few scale up/down events. Not sure if these events OR any intermittent network issues causing any open Spark sessions to lose connectivity.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;But I thought DatabricksConnect 18.2.1 fixed handling these reconnect issues better.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;I am not exactly sure of what is triggering but I am positive its Library not able to handle some scenarios. If I run all code with-in cluster in Notebook, I don't remember seeing any issues anytime. So I am suspecting either network/scale out events combined with Library 18.2.1 not working as expected.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Appreciate if anyone faced same issues OR share some insight or workarounds to get over this.&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;Please NOTE: This happens once a while and not always. Re-runs Spark application from AKS goes without errors most of the time&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 02 Jul 2026 01:44:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/statuscode-unimplemented-error-databricksconnect-library-using/m-p/161140#M54990</guid>
      <dc:creator>JTBS</dc:creator>
      <dc:date>2026-07-02T01:44:53Z</dc:date>
    </item>
    <item>
      <title>Re: StatusCode.UNIMPLEMENTED error: DatabricksConnect library using AKS/PySpark to calling Spark clu</title>
      <link>https://community.databricks.com/t5/data-engineering/statuscode-unimplemented-error-databricksconnect-library-using/m-p/161147#M54991</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Its the &lt;STRONG&gt;remote connection state management issue&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;that occurs when the cluster scales.&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN class=""&gt;StatusCode.UNIMPLEMENTED&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;with HTTP2 404 indicates the Databricks Connect client is looking to reach a target like a specific worker node that do not exist after the cluster scale-down events.&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Cluster &lt;STRONG&gt;autoscaling &lt;/STRONG&gt;removes worker nodes&amp;nbsp;during scale-down events&lt;/LI&gt;&lt;LI&gt;It may cache &lt;STRONG&gt;stale node &lt;/STRONG&gt;references&amp;nbsp;in its connection pool&lt;/LI&gt;&lt;LI&gt;While new runtime has improved reconnection logic,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;it may not fully handle &lt;STRONG&gt;middle&amp;nbsp;&lt;/STRONG&gt;operations during &lt;STRONG&gt;rapid&lt;/STRONG&gt; scale events&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;You can follow below to reduce the issues&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;H3&gt;&lt;FONT size="3"&gt;&lt;STRONG&gt;&lt;U&gt;Hard Retry &amp;amp; Timeout Settings&lt;/U&gt; -&amp;nbsp;&lt;/STRONG&gt;Add Spark configurations given below to the cluster to fail fast and retry. You can reduce further after validation&lt;/FONT&gt;&lt;/H3&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;LI-CODE lang="python"&gt;spark.databricks.io.cache.maxRetries 5
spark.databricks.io.cache.timeout 60s
spark.rpc.askTimeout 300s
spark.rpc.lookupTimeout 300s&lt;/LI-CODE&gt;&lt;UL&gt;&lt;LI&gt;&lt;H3&gt;&lt;FONT size="3"&gt;&lt;U&gt;&lt;STRONG&gt;Connection Pool Behavior&lt;/STRONG&gt;&lt;/U&gt; -&amp;nbsp;Set the Databricks Connect client configuration given below in the AKS application&lt;/FONT&gt;&lt;/H3&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;LI-CODE lang="python"&gt;# RPC timeouts
spark.conf.set("spark.rpc.retry.wait", "5s")
spark.conf.set("spark.rpc.numRetries", "5")&lt;/LI-CODE&gt;&lt;UL&gt;&lt;LI&gt;&lt;H3&gt;&lt;FONT size="3"&gt;&lt;U&gt;&lt;STRONG&gt;Application-Level Retry Logic&lt;/STRONG&gt;&lt;/U&gt; -&amp;nbsp;&lt;SPAN&gt;Wrap the Spark operations with retry logic to handle transient failures in the spark code&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/H3&gt;&lt;/LI&gt;&lt;LI&gt;&lt;U&gt;&lt;STRONG&gt;&lt;FONT size="3"&gt;Cluster Configurations&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/U&gt;&lt;FONT size="3"&gt;&lt;FONT size="3"&gt; -&amp;nbsp;Reduce Autoscaling Disruption,&amp;nbsp;Reduce autoscaling frequency by setting longer scale-down windows. You can use cluster &lt;STRONG&gt;pools&lt;/STRONG&gt; to keep instances warm and reduce scale-up/down frequency.&lt;/FONT&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;LI-CODE lang="python"&gt;spark.databricks.clusterUsageTags.autoTerminationMinutes 30&lt;/LI-CODE&gt;&lt;UL&gt;&lt;LI&gt;&lt;U&gt;&lt;STRONG&gt;Disable Autoscaling&lt;/STRONG&gt;&lt;/U&gt; -&amp;nbsp;You can use a &lt;STRONG&gt;fixed size&lt;/STRONG&gt; cluster if your workload is predictable to eliminate scale related disruptions.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;Alternatives&lt;/STRONG&gt;&lt;/U&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;U&gt;&lt;STRONG&gt;Databricks Lakeflow Jobs&lt;/STRONG&gt;&lt;/U&gt; - You can directly trigger Databricks &lt;STRONG&gt;Jobs&lt;/STRONG&gt;&amp;nbsp;from &lt;STRONG&gt;AKS&lt;/STRONG&gt; instead of using Databricks Connect for &lt;STRONG&gt;scheduled/batch&lt;/STRONG&gt; workloads from AKS&lt;STRONG&gt;.&lt;/STRONG&gt; It eliminates long-lived connection issues entirely as Jobs run natively on the cluster with full resilience.&lt;/LI&gt;&lt;LI&gt;&lt;H3&gt;&lt;FONT size="3"&gt;&lt;U&gt;Serverless&lt;/U&gt; - You can use Databricks SQL Connector instead of Databricks Connect if the workload is majorly in SQL. SQL Warehouses have better connection management. You can use serverless jobs too.&lt;/FONT&gt;&lt;/H3&gt;&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Thu, 02 Jul 2026 03:49:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/statuscode-unimplemented-error-databricksconnect-library-using/m-p/161147#M54991</guid>
      <dc:creator>balajij8</dc:creator>
      <dc:date>2026-07-02T03:49:43Z</dc:date>
    </item>
    <item>
      <title>Re: StatusCode.UNIMPLEMENTED error: DatabricksConnect library using AKS/PySpark to calling Spark clu</title>
      <link>https://community.databricks.com/t5/data-engineering/statuscode-unimplemented-error-databricksconnect-library-using/m-p/161241#M55006</link>
      <description>&lt;P class="p8i6j01 paragraph"&gt;&lt;STRONG&gt;Short answer:&lt;/STRONG&gt; this looks more like an intermittent &lt;STRONG&gt;Spark Connect transport/routing issue&lt;/STRONG&gt; than a Spark job logic issue. Databricks Connect uses &lt;STRONG&gt;gRPC over HTTP/2&lt;/STRONG&gt;, and the specific &lt;CODE class="p8i6j0f"&gt;InactiveRpcError ... UNIMPLEMENTED ... Received http2 header with status: 404&lt;/CODE&gt; pattern is consistent with an intermediary returning a &lt;STRONG&gt;non-gRPC HTTP 404&lt;/STRONG&gt; instead of a Spark Connect response.&lt;/P&gt;
&lt;P class="p8i6j01 paragraph"&gt;A few things stand out:&lt;/P&gt;
&lt;UL class="p8i6j07 p8i6j02"&gt;
&lt;LI class="p8i6j0a"&gt;Public release notes do &lt;STRONG&gt;not&lt;/STRONG&gt; say that &lt;CODE class="p8i6j0f"&gt;18.2.1&lt;/CODE&gt; specifically added the 404/reconnect handling you’re expecting; for Python, &lt;CODE class="p8i6j0f"&gt;18.2.1&lt;/CODE&gt; is only described as “minor fixes and internal improvements.”&lt;/LI&gt;
&lt;LI class="p8i6j0a"&gt;The explicit retry improvement for transient non-gRPC responses like &lt;STRONG&gt;HTTP 404&lt;/STRONG&gt; is called out in the &lt;CODE class="p8i6j0f"&gt;18.1.3&lt;/CODE&gt; line: the client “automatically retries transient errors that occur when an intermediary proxy returns a non-gRPC response (for example, HTTP 404…).”&lt;/LI&gt;
&lt;LI class="p8i6j0a"&gt;There is already a newer &lt;CODE class="p8i6j0f"&gt;18.2.2&lt;/CODE&gt; client, and Databricks recommends using the latest version; the runtime version must be &lt;STRONG&gt;greater than or equal to&lt;/STRONG&gt; the Connect version.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="p8i6j01 paragraph"&gt;So I would not conclude “library bug only,” but I also would &lt;STRONG&gt;not&lt;/STRONG&gt; dismiss your network / scale-event theory. Similar internal examples show Spark Connect failures where the router endpoint became temporarily unavailable or upstream returned invalid &lt;CODE class="p8i6j0f"&gt;503&lt;/CODE&gt;, which is very much in the same family of transient transport failures rather than Spark execution failures&lt;/P&gt;
&lt;H3 class="_9k2iva0 p8i6j0c _1ibi0s314 heading3 _9k2iva1"&gt;What I’d do&lt;/H3&gt;
&lt;OL class="p8i6j02"&gt;
&lt;LI class="p8i6j0a"&gt;&lt;STRONG&gt;Upgrade the client first&lt;/STRONG&gt; to &lt;CODE class="p8i6j0f"&gt;databricks-connect 18.2.2&lt;/CODE&gt; (or newer) and keep the cluster runtime at a compatible version.&lt;/LI&gt;
&lt;LI class="p8i6j0a"&gt;Add &lt;STRONG&gt;application-level retry with session recreation&lt;/STRONG&gt; around idempotent Spark actions. When Spark Connect sessions expire or the transport drops, the guidance is to create a new session via &lt;CODE class="p8i6j0f"&gt;DatabricksSession.builder.getOrCreate()&lt;/CODE&gt; for Databricks Connect clients.&lt;/LI&gt;
&lt;LI class="p8i6j0a"&gt;Treat this as a &lt;STRONG&gt;transient-connectivity class&lt;/STRONG&gt; error in AKS: catch &lt;CODE class="p8i6j0f"&gt;_InactiveRpcError&lt;/CODE&gt; / &lt;CODE class="p8i6j0f"&gt;UNAVAILABLE&lt;/CODE&gt; / HTTP-404-on-gRPC-path, rebuild the session, and retry the work unit if it is safe to do so.&lt;/LI&gt;
&lt;LI class="p8i6j0a"&gt;Turn on &lt;STRONG&gt;Databricks Connect Python logging&lt;/STRONG&gt; so you can correlate exact failure timestamps with cluster scale events or network events.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P class="p8i6j01 paragraph"&gt;The safest workaround is to structure the AKS job so each major step can be retried after:&lt;/P&gt;
&lt;UL class="p8i6j07 p8i6j02"&gt;
&lt;LI class="p8i6j0a"&gt;rebuilding the Spark session, and&lt;/LI&gt;
&lt;LI class="p8i6j0a"&gt;resuming from a checkpoint / last completed stage.&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Thu, 02 Jul 2026 17:46:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/statuscode-unimplemented-error-databricksconnect-library-using/m-p/161241#M55006</guid>
      <dc:creator>iyashk-DB</dc:creator>
      <dc:date>2026-07-02T17:46:28Z</dc:date>
    </item>
  </channel>
</rss>

