cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

StatusCode.UNIMPLEMENTED error: DatabricksConnect library using AKS/PySpark to calling Spark cluster

JTBS
New Contributor II

I am running PySpark application in AKS/Pythgon container/pod:

Using Databricks 18.2.1 library with Databricks Spark cluster 18.2

Once a while I am getting below error: 

InactiveRpcError of RPC that terminated with: status = StatusCode.UNIMPLEMENTED details = "Received http2 header with status: 404" debug_error_string = "UNIMPLEMENTED:Received http2 header with status: 404

I don't see any cluster health or events that are concerning other than there are few scale up/down events. Not sure if these events OR any intermittent network issues causing any open Spark sessions to lose connectivity.

But I thought DatabricksConnect 18.2.1 fixed handling these reconnect issues better.

I am not exactly sure of what is triggering but I am positive its Library not able to handle some scenarios. If I run all code with-in cluster in Notebook, I don't remember seeing any issues anytime. So I am suspecting either network/scale out events combined with Library 18.2.1 not working as expected.

Appreciate if anyone faced same issues OR share some insight or workarounds to get over this.

Please NOTE: This happens once a while and not always. Re-runs Spark application from AKS goes without errors most of the time

 

1 REPLY 1

balajij8
Contributor III

Its the remote connection state management issue that occurs when the cluster scales.  StatusCode.UNIMPLEMENTED with HTTP2 404 indicates the Databricks Connect client is looking to reach a target like a specific worker node that do not exist after the cluster scale-down events.

  • Cluster autoscaling removes worker nodes during scale-down events
  • It may cache stale node references in its connection pool
  • While new runtime has improved reconnection logic, it may not fully handle middle operations during rapid scale events

You can follow below to reduce the issues

  • Hard Retry & Timeout SettingsAdd Spark configurations given below to the cluster to fail fast and retry. You can reduce further after validation

spark.databricks.io.cache.maxRetries 5
spark.databricks.io.cache.timeout 60s
spark.rpc.askTimeout 300s
spark.rpc.lookupTimeout 300s
  • Connection Pool Behavior - Set the Databricks Connect client configuration given below in the AKS application

# RPC timeouts
spark.conf.set("spark.rpc.retry.wait", "5s")
spark.conf.set("spark.rpc.numRetries", "5")
  • Application-Level Retry LogicWrap the Spark operations with retry logic to handle transient failures in the spark code

  • Cluster Configurations - Reduce Autoscaling Disruption, Reduce autoscaling frequency by setting longer scale-down windows. You can use cluster pools to keep instances warm and reduce scale-up/down frequency.
spark.databricks.clusterUsageTags.autoTerminationMinutes 30
  • Disable Autoscaling - You can use a fixed size cluster if your workload is predictable to eliminate scale related disruptions.

Alternatives

  • Databricks Lakeflow Jobs - You can directly trigger Databricks Jobs from AKS instead of using Databricks Connect for scheduled/batch workloads from AKS. It eliminates long-lived connection issues entirely as Jobs run natively on the cluster with full resilience.
  • Serverless - You can use Databricks SQL Connector instead of Databricks Connect if the workload is majorly in SQL. SQL Warehouses have better connection management. You can use serverless jobs too.