Re: StatusCode.UNIMPLEMENTED error: DatabricksConn...

balajij8 · a month ago

Its the remote connection state management issue that occurs when the cluster scales. StatusCode.UNIMPLEMENTED with HTTP2 404 indicates the Databricks Connect client is looking to reach a target like a specific worker node that do not exist after the cluster scale-down events.

Cluster autoscaling removes worker nodes during scale-down events
It may cache stale node references in its connection pool
While new runtime has improved reconnection logic, it may not fully handle middle operations during rapid scale events

You can follow below to reduce the issues

Hard Retry & Timeout Settings - Add Spark configurations given below to the cluster to fail fast and retry. You can reduce further after validation

spark.databricks.io.cache.maxRetries 5
spark.databricks.io.cache.timeout 60s
spark.rpc.askTimeout 300s
spark.rpc.lookupTimeout 300s

Connection Pool Behavior - Set the Databricks Connect client configuration given below in the AKS application

# RPC timeouts
spark.conf.set("spark.rpc.retry.wait", "5s")
spark.conf.set("spark.rpc.numRetries", "5")

Application-Level Retry Logic - Wrap the Spark operations with retry logic to handle transient failures in the spark code
Cluster Configurations - Reduce Autoscaling Disruption, Reduce autoscaling frequency by setting longer scale-down windows. You can use cluster pools to keep instances warm and reduce scale-up/down frequency.

spark.databricks.clusterUsageTags.autoTerminationMinutes 30

Disable Autoscaling - You can use a fixed size cluster if your workload is predictable to eliminate scale related disruptions.

Alternatives

Databricks Lakeflow Jobs - You can directly trigger Databricks Jobs from AKS instead of using Databricks Connect for scheduled/batch workloads from AKS. It eliminates long-lived connection issues entirely as Jobs run natively on the cluster with full resilience.
Serverless - You can use Databricks SQL Connector instead of Databricks Connect if the workload is majorly in SQL. SQL Warehouses have better connection management. You can use serverless jobs too.

View solution in original post

Hard Retry & Timeout Settings - Add Spark configurations given below to the cluster to fail fast and retry. You can reduce further after validation

Connection Pool Behavior - Set the Databricks Connect client configuration given below in the AKS application

Application-Level Retry Logic - Wrap the Spark operations with retry logic to handle transient failures in the spark code

Serverless - You can use Databricks SQL Connector instead of Databricks Connect if the workload is majorly in SQL. SQL Warehouses have better connection management. You can use serverless jobs too.