cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

StatusCode.UNIMPLEMENTED error: DatabricksConnect library using AKS/PySpark to calling Spark cluster

JTBS
New Contributor II

I am running PySpark application in AKS/Pythgon container/pod:

Using Databricks 18.2.1 library with Databricks Spark cluster 18.2

Once a while I am getting below error: 

InactiveRpcError of RPC that terminated with: status = StatusCode.UNIMPLEMENTED details = "Received http2 header with status: 404" debug_error_string = "UNIMPLEMENTED:Received http2 header with status: 404

I don't see any cluster health or events that are concerning other than there are few scale up/down events. Not sure if these events OR any intermittent network issues causing any open Spark sessions to lose connectivity.

But I thought DatabricksConnect 18.2.1 fixed handling these reconnect issues better.

I am not exactly sure of what is triggering but I am positive its Library not able to handle some scenarios. If I run all code with-in cluster in Notebook, I don't remember seeing any issues anytime. So I am suspecting either network/scale out events combined with Library 18.2.1 not working as expected.

Appreciate if anyone faced same issues OR share some insight or workarounds to get over this.

Please NOTE: This happens once a while and not always. Re-runs Spark application from AKS goes without errors most of the time

 

2 REPLIES 2

balajij8
Contributor III

Its the remote connection state management issue that occurs when the cluster scales.  StatusCode.UNIMPLEMENTED with HTTP2 404 indicates the Databricks Connect client is looking to reach a target like a specific worker node that do not exist after the cluster scale-down events.

  • Cluster autoscaling removes worker nodes during scale-down events
  • It may cache stale node references in its connection pool
  • While new runtime has improved reconnection logic, it may not fully handle middle operations during rapid scale events

You can follow below to reduce the issues

  • Hard Retry & Timeout SettingsAdd Spark configurations given below to the cluster to fail fast and retry. You can reduce further after validation

spark.databricks.io.cache.maxRetries 5
spark.databricks.io.cache.timeout 60s
spark.rpc.askTimeout 300s
spark.rpc.lookupTimeout 300s
  • Connection Pool Behavior - Set the Databricks Connect client configuration given below in the AKS application

# RPC timeouts
spark.conf.set("spark.rpc.retry.wait", "5s")
spark.conf.set("spark.rpc.numRetries", "5")
  • Application-Level Retry LogicWrap the Spark operations with retry logic to handle transient failures in the spark code

  • Cluster Configurations - Reduce Autoscaling Disruption, Reduce autoscaling frequency by setting longer scale-down windows. You can use cluster pools to keep instances warm and reduce scale-up/down frequency.
spark.databricks.clusterUsageTags.autoTerminationMinutes 30
  • Disable Autoscaling - You can use a fixed size cluster if your workload is predictable to eliminate scale related disruptions.

Alternatives

  • Databricks Lakeflow Jobs - You can directly trigger Databricks Jobs from AKS instead of using Databricks Connect for scheduled/batch workloads from AKS. It eliminates long-lived connection issues entirely as Jobs run natively on the cluster with full resilience.
  • Serverless - You can use Databricks SQL Connector instead of Databricks Connect if the workload is majorly in SQL. SQL Warehouses have better connection management. You can use serverless jobs too.

iyashk-DB
Databricks Employee
Databricks Employee

Short answer: this looks more like an intermittent Spark Connect transport/routing issue than a Spark job logic issue. Databricks Connect uses gRPC over HTTP/2, and the specific InactiveRpcError ... UNIMPLEMENTED ... Received http2 header with status: 404 pattern is consistent with an intermediary returning a non-gRPC HTTP 404 instead of a Spark Connect response.

A few things stand out:

  • Public release notes do not say that 18.2.1 specifically added the 404/reconnect handling youโ€™re expecting; for Python, 18.2.1 is only described as โ€œminor fixes and internal improvements.โ€
  • The explicit retry improvement for transient non-gRPC responses like HTTP 404 is called out in the 18.1.3 line: the client โ€œautomatically retries transient errors that occur when an intermediary proxy returns a non-gRPC response (for example, HTTP 404โ€ฆ).โ€
  • There is already a newer 18.2.2 client, and Databricks recommends using the latest version; the runtime version must be greater than or equal to the Connect version.

So I would not conclude โ€œlibrary bug only,โ€ but I also would not dismiss your network / scale-event theory. Similar internal examples show Spark Connect failures where the router endpoint became temporarily unavailable or upstream returned invalid 503, which is very much in the same family of transient transport failures rather than Spark execution failures

What Iโ€™d do

  1. Upgrade the client first to databricks-connect 18.2.2 (or newer) and keep the cluster runtime at a compatible version.
  2. Add application-level retry with session recreation around idempotent Spark actions. When Spark Connect sessions expire or the transport drops, the guidance is to create a new session via DatabricksSession.builder.getOrCreate() for Databricks Connect clients.
  3. Treat this as a transient-connectivity class error in AKS: catch _InactiveRpcError / UNAVAILABLE / HTTP-404-on-gRPC-path, rebuild the session, and retry the work unit if it is safe to do so.
  4. Turn on Databricks Connect Python logging so you can correlate exact failure timestamps with cluster scale events or network events.

The safest workaround is to structure the AKS job so each major step can be retried after:

  • rebuilding the Spark session, and
  • resuming from a checkpoint / last completed stage.