balajij8
Contributor III

Its the remote connection state management issue that occurs when the cluster scales.  StatusCode.UNIMPLEMENTED with HTTP2 404 indicates the Databricks Connect client is looking to reach a target like a specific worker node that do not exist after the cluster scale-down events.

  • Cluster autoscaling removes worker nodes during scale-down events
  • It may cache stale node references in its connection pool
  • While new runtime has improved reconnection logic, it may not fully handle middle operations during rapid scale events

You can follow below to reduce the issues

  • Hard Retry & Timeout SettingsAdd Spark configurations given below to the cluster to fail fast and retry. You can reduce further after validation

spark.databricks.io.cache.maxRetries 5
spark.databricks.io.cache.timeout 60s
spark.rpc.askTimeout 300s
spark.rpc.lookupTimeout 300s
  • Connection Pool Behavior - Set the Databricks Connect client configuration given below in the AKS application

# RPC timeouts
spark.conf.set("spark.rpc.retry.wait", "5s")
spark.conf.set("spark.rpc.numRetries", "5")
  • Application-Level Retry LogicWrap the Spark operations with retry logic to handle transient failures in the spark code

  • Cluster Configurations - Reduce Autoscaling Disruption, Reduce autoscaling frequency by setting longer scale-down windows. You can use cluster pools to keep instances warm and reduce scale-up/down frequency.
spark.databricks.clusterUsageTags.autoTerminationMinutes 30
  • Disable Autoscaling - You can use a fixed size cluster if your workload is predictable to eliminate scale related disruptions.

Alternatives

  • Databricks Lakeflow Jobs - You can directly trigger Databricks Jobs from AKS instead of using Databricks Connect for scheduled/batch workloads from AKS. It eliminates long-lived connection issues entirely as Jobs run natively on the cluster with full resilience.
  • Serverless - You can use Databricks SQL Connector instead of Databricks Connect if the workload is majorly in SQL. SQL Warehouses have better connection management. You can use serverless jobs too.