I am running PySpark application in AKS/Pythgon container/pod:
Using Databricks 18.2.1 library with Databricks Spark cluster 18.2
Once a while I am getting below error:
InactiveRpcError of RPC that terminated with: status = StatusCode.UNIMPLEMENTED details = "Received http2 header with status: 404" debug_error_string = "UNIMPLEMENTED:Received http2 header with status: 404
I don't see any cluster health or events that are concerning other than there are few scale up/down events. Not sure if these events OR any intermittent network issues causing any open Spark sessions to lose connectivity.
But I thought DatabricksConnect 18.2.1 fixed handling these reconnect issues better.
I am not exactly sure of what is triggering but I am positive its Library not able to handle some scenarios. If I run all code with-in cluster in Notebook, I don't remember seeing any issues anytime. So I am suspecting either network/scale out events combined with Library 18.2.1 not working as expected.
Appreciate if anyone faced same issues OR share some insight or workarounds to get over this.
Please NOTE: This happens once a while and not always. Re-runs Spark application from AKS goes without errors most of the time