Databricks Community

JTBS · 3 weeks ago

I am running PySpark application in AKS/Pythgon container/pod:

Using Databricks 18.2.1 library with Databricks Spark cluster 18.2

Once a while I am getting below error:

InactiveRpcError of RPC that terminated with: status = StatusCode.UNIMPLEMENTED details = "Received http2 header with status: 404" debug_error_string = "UNIMPLEMENTED:Received http2 header with status: 404

I don't see any cluster health or events that are concerning other than there are few scale up/down events. Not sure if these events OR any intermittent network issues causing any open Spark sessions to lose connectivity.

But I thought DatabricksConnect 18.2.1 fixed handling these reconnect issues better.

I am not exactly sure of what is triggering but I am positive its Library not able to handle some scenarios. If I run all code with-in cluster in Notebook, I don't remember seeing any issues anytime. So I am suspecting either network/scale out events combined with Library 18.2.1 not working as expected.

Appreciate if anyone faced same issues OR share some insight or workarounds to get over this.

Please NOTE: This happens once a while and not always. Re-runs Spark application from AKS goes without errors most of the time

balajij8 · 3 weeks ago

Its the remote connection state management issue that occurs when the cluster scales. StatusCode.UNIMPLEMENTED with HTTP2 404 indicates the Databricks Connect client is looking to reach a target like a specific worker node that do not exist after the cluster scale-down events.

Cluster autoscaling removes worker nodes during scale-down events
It may cache stale node references in its connection pool
While new runtime has improved reconnection logic, it may not fully handle middle operations during rapid scale events

You can follow below to reduce the issues

Hard Retry & Timeout Settings - Add Spark configurations given below to the cluster to fail fast and retry. You can reduce further after validation

spark.databricks.io.cache.maxRetries 5
spark.databricks.io.cache.timeout 60s
spark.rpc.askTimeout 300s
spark.rpc.lookupTimeout 300s

Connection Pool Behavior - Set the Databricks Connect client configuration given below in the AKS application

# RPC timeouts
spark.conf.set("spark.rpc.retry.wait", "5s")
spark.conf.set("spark.rpc.numRetries", "5")

Application-Level Retry Logic - Wrap the Spark operations with retry logic to handle transient failures in the spark code
Cluster Configurations - Reduce Autoscaling Disruption, Reduce autoscaling frequency by setting longer scale-down windows. You can use cluster pools to keep instances warm and reduce scale-up/down frequency.

spark.databricks.clusterUsageTags.autoTerminationMinutes 30

Disable Autoscaling - You can use a fixed size cluster if your workload is predictable to eliminate scale related disruptions.

Alternatives

Databricks Lakeflow Jobs - You can directly trigger Databricks Jobs from AKS instead of using Databricks Connect for scheduled/batch workloads from AKS. It eliminates long-lived connection issues entirely as Jobs run natively on the cluster with full resilience.
Serverless - You can use Databricks SQL Connector instead of Databricks Connect if the workload is majorly in SQL. SQL Warehouses have better connection management. You can use serverless jobs too.

View solution in original post

iyashk-DB · 3 weeks ago

Short answer: this looks more like an intermittent Spark Connect transport/routing issue than a Spark job logic issue. Databricks Connect uses gRPC over HTTP/2, and the specific InactiveRpcError ... UNIMPLEMENTED ... Received http2 header with status: 404 pattern is consistent with an intermediary returning a non-gRPC HTTP 404 instead of a Spark Connect response.

A few things stand out:

Public release notes do not say that 18.2.1 specifically added the 404/reconnect handling you’re expecting; for Python, 18.2.1 is only described as “minor fixes and internal improvements.”
The explicit retry improvement for transient non-gRPC responses like HTTP 404 is called out in the 18.1.3 line: the client “automatically retries transient errors that occur when an intermediary proxy returns a non-gRPC response (for example, HTTP 404…).”
There is already a newer 18.2.2 client, and Databricks recommends using the latest version; the runtime version must be greater than or equal to the Connect version.

So I would not conclude “library bug only,” but I also would not dismiss your network / scale-event theory. Similar internal examples show Spark Connect failures where the router endpoint became temporarily unavailable or upstream returned invalid 503, which is very much in the same family of transient transport failures rather than Spark execution failures

What I’d do

Upgrade the client first to databricks-connect 18.2.2 (or newer) and keep the cluster runtime at a compatible version.
Add application-level retry with session recreation around idempotent Spark actions. When Spark Connect sessions expire or the transport drops, the guidance is to create a new session via DatabricksSession.builder.getOrCreate() for Databricks Connect clients.
Treat this as a transient-connectivity class error in AKS: catch _InactiveRpcError / UNAVAILABLE / HTTP-404-on-gRPC-path, rebuild the session, and retry the work unit if it is safe to do so.
Turn on Databricks Connect Python logging so you can correlate exact failure timestamps with cluster scale events or network events.

The safest workaround is to structure the AKS job so each major step can be retried after:

rebuilding the Spark session, and
resuming from a checkpoint / last completed stage.

View solution in original post

balajij8 · 3 weeks ago

Its the remote connection state management issue that occurs when the cluster scales. StatusCode.UNIMPLEMENTED with HTTP2 404 indicates the Databricks Connect client is looking to reach a target like a specific worker node that do not exist after the cluster scale-down events.

Cluster autoscaling removes worker nodes during scale-down events
It may cache stale node references in its connection pool
While new runtime has improved reconnection logic, it may not fully handle middle operations during rapid scale events

You can follow below to reduce the issues

Hard Retry & Timeout Settings - Add Spark configurations given below to the cluster to fail fast and retry. You can reduce further after validation

spark.databricks.io.cache.maxRetries 5
spark.databricks.io.cache.timeout 60s
spark.rpc.askTimeout 300s
spark.rpc.lookupTimeout 300s

Connection Pool Behavior - Set the Databricks Connect client configuration given below in the AKS application

# RPC timeouts
spark.conf.set("spark.rpc.retry.wait", "5s")
spark.conf.set("spark.rpc.numRetries", "5")

Application-Level Retry Logic - Wrap the Spark operations with retry logic to handle transient failures in the spark code
Cluster Configurations - Reduce Autoscaling Disruption, Reduce autoscaling frequency by setting longer scale-down windows. You can use cluster pools to keep instances warm and reduce scale-up/down frequency.

spark.databricks.clusterUsageTags.autoTerminationMinutes 30

Disable Autoscaling - You can use a fixed size cluster if your workload is predictable to eliminate scale related disruptions.

Alternatives

Databricks Lakeflow Jobs - You can directly trigger Databricks Jobs from AKS instead of using Databricks Connect for scheduled/batch workloads from AKS. It eliminates long-lived connection issues entirely as Jobs run natively on the cluster with full resilience.
Serverless - You can use Databricks SQL Connector instead of Databricks Connect if the workload is majorly in SQL. SQL Warehouses have better connection management. You can use serverless jobs too.

iyashk-DB · 3 weeks ago

Short answer: this looks more like an intermittent Spark Connect transport/routing issue than a Spark job logic issue. Databricks Connect uses gRPC over HTTP/2, and the specific InactiveRpcError ... UNIMPLEMENTED ... Received http2 header with status: 404 pattern is consistent with an intermediary returning a non-gRPC HTTP 404 instead of a Spark Connect response.

A few things stand out:

Public release notes do not say that 18.2.1 specifically added the 404/reconnect handling you’re expecting; for Python, 18.2.1 is only described as “minor fixes and internal improvements.”
The explicit retry improvement for transient non-gRPC responses like HTTP 404 is called out in the 18.1.3 line: the client “automatically retries transient errors that occur when an intermediary proxy returns a non-gRPC response (for example, HTTP 404…).”
There is already a newer 18.2.2 client, and Databricks recommends using the latest version; the runtime version must be greater than or equal to the Connect version.

So I would not conclude “library bug only,” but I also would not dismiss your network / scale-event theory. Similar internal examples show Spark Connect failures where the router endpoint became temporarily unavailable or upstream returned invalid 503, which is very much in the same family of transient transport failures rather than Spark execution failures

What I’d do

Upgrade the client first to databricks-connect 18.2.2 (or newer) and keep the cluster runtime at a compatible version.
Add application-level retry with session recreation around idempotent Spark actions. When Spark Connect sessions expire or the transport drops, the guidance is to create a new session via DatabricksSession.builder.getOrCreate() for Databricks Connect clients.
Treat this as a transient-connectivity class error in AKS: catch _InactiveRpcError / UNAVAILABLE / HTTP-404-on-gRPC-path, rebuild the session, and retry the work unit if it is safe to do so.
Turn on Databricks Connect Python logging so you can correlate exact failure timestamps with cluster scale events or network events.

The safest workaround is to structure the AKS job so each major step can be retried after:

rebuilding the Spark session, and
resuming from a checkpoint / last completed stage.

Databricks Community

StatusCode.UNIMPLEMENTED error: DatabricksConnect library using AKS/PySpark to calling Spark cluster

Hard Retry & Timeout Settings - Add Spark configurations given below to the cluster to fail fast and retry. You can reduce further after validation

Connection Pool Behavior - Set the Databricks Connect client configuration given below in the AKS application

Application-Level Retry Logic - Wrap the Spark operations with retry logic to handle transient failures in the spark code

Serverless - You can use Databricks SQL Connector instead of Databricks Connect if the workload is majorly in SQL. SQL Warehouses have better connection management. You can use serverless jobs too.

What I’d do

Hard Retry & Timeout Settings - Add Spark configurations given below to the cluster to fail fast and retry. You can reduce further after validation

Connection Pool Behavior - Set the Databricks Connect client configuration given below in the AKS application

Application-Level Retry Logic - Wrap the Spark operations with retry logic to handle transient failures in the spark code

Serverless - You can use Databricks SQL Connector instead of Databricks Connect if the workload is majorly in SQL. SQL Warehouses have better connection management. You can use serverless jobs too.

What I’d do

Databricks AMER Learning Festival | Virtual Training

Introducing the Genie Hub: Ask Questions, Share Builds, and Master Conversational Analytics

🌟 Community Pulse: Your Weekly Roundup! July 13 – 19, 2026

Solution Accelerator Series | Social Determinants of Health

Upcoming Community BrickTalk | Sports Analytics: Turning Tracking Data into Real-Time AI Decisions