- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yesterday
Short answer: this looks more like an intermittent Spark Connect transport/routing issue than a Spark job logic issue. Databricks Connect uses gRPC over HTTP/2, and the specific InactiveRpcError ... UNIMPLEMENTED ... Received http2 header with status: 404 pattern is consistent with an intermediary returning a non-gRPC HTTP 404 instead of a Spark Connect response.
A few things stand out:
- Public release notes do not say that
18.2.1specifically added the 404/reconnect handling you’re expecting; for Python,18.2.1is only described as “minor fixes and internal improvements.” - The explicit retry improvement for transient non-gRPC responses like HTTP 404 is called out in the
18.1.3line: the client “automatically retries transient errors that occur when an intermediary proxy returns a non-gRPC response (for example, HTTP 404…).” - There is already a newer
18.2.2client, and Databricks recommends using the latest version; the runtime version must be greater than or equal to the Connect version.
So I would not conclude “library bug only,” but I also would not dismiss your network / scale-event theory. Similar internal examples show Spark Connect failures where the router endpoint became temporarily unavailable or upstream returned invalid 503, which is very much in the same family of transient transport failures rather than Spark execution failures
What I’d do
- Upgrade the client first to
databricks-connect 18.2.2(or newer) and keep the cluster runtime at a compatible version. - Add application-level retry with session recreation around idempotent Spark actions. When Spark Connect sessions expire or the transport drops, the guidance is to create a new session via
DatabricksSession.builder.getOrCreate()for Databricks Connect clients. - Treat this as a transient-connectivity class error in AKS: catch
_InactiveRpcError/UNAVAILABLE/ HTTP-404-on-gRPC-path, rebuild the session, and retry the work unit if it is safe to do so. - Turn on Databricks Connect Python logging so you can correlate exact failure timestamps with cluster scale events or network events.
The safest workaround is to structure the AKS job so each major step can be retried after:
- rebuilding the Spark session, and
- resuming from a checkpoint / last completed stage.