Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yesterday
Its the remote connection state management issue that occurs when the cluster scales. StatusCode.UNIMPLEMENTED with HTTP2 404 indicates the Databricks Connect client is looking to reach a target like a specific worker node that do not exist after the cluster scale-down events.
- Cluster autoscaling removes worker nodes during scale-down events
- It may cache stale node references in its connection pool
- While new runtime has improved reconnection logic, it may not fully handle middle operations during rapid scale events
You can follow below to reduce the issues
Hard Retry & Timeout Settings - Add Spark configurations given below to the cluster to fail fast and retry. You can reduce further after validation
spark.databricks.io.cache.maxRetries 5
spark.databricks.io.cache.timeout 60s
spark.rpc.askTimeout 300s
spark.rpc.lookupTimeout 300sConnection Pool Behavior - Set the Databricks Connect client configuration given below in the AKS application
# RPC timeouts
spark.conf.set("spark.rpc.retry.wait", "5s")
spark.conf.set("spark.rpc.numRetries", "5")Application-Level Retry Logic - Wrap the Spark operations with retry logic to handle transient failures in the spark code
- Cluster Configurations - Reduce Autoscaling Disruption, Reduce autoscaling frequency by setting longer scale-down windows. You can use cluster pools to keep instances warm and reduce scale-up/down frequency.
spark.databricks.clusterUsageTags.autoTerminationMinutes 30- Disable Autoscaling - You can use a fixed size cluster if your workload is predictable to eliminate scale related disruptions.
Alternatives
- Databricks Lakeflow Jobs - You can directly trigger Databricks Jobs from AKS instead of using Databricks Connect for scheduled/batch workloads from AKS. It eliminates long-lived connection issues entirely as Jobs run natively on the cluster with full resilience.
Serverless - You can use Databricks SQL Connector instead of Databricks Connect if the workload is majorly in SQL. SQL Warehouses have better connection management. You can use serverless jobs too.