Re: Apache "Spark Connect"

Louis_Frolio · ‎03-20-2026

Great questions — let me take them one at a time.

On the jobs cluster + Spark Connect timing scenario: yes, the cluster will shut down on you. A jobs cluster is lifecycle-bound to the job run that launched it — the scheduler creates it, runs the job, and terminates it when the job is done. There is no mechanism that detects an external Spark Connect session mid-execution and waits for it to finish. Your remote client gets dropped, and any in-flight plan execution dies with it.

Worth flagging: Databricks Connect is designed to target all-purpose clusters, not jobs clusters. If your use case involves remote clients running iterative or long-running Spark work, an all-purpose cluster is the right fit. You configure auto-termination based on inactivity, and active Spark Connect sessions count as activity. If you want fully managed lifecycle with no cluster provisioning overhead, serverless compute is increasingly where Databricks is pointing people for this pattern.

On SparkR vs. sparklyr — these are two distinct things, and it's worth separating them. SparkR is the official Apache Spark R package, maintained by the Apache project. Databricks deprecated their support for it in their runtimes. sparklyr is a separate R package from Posit (formerly RStudio) — it predates the SparkR deprecation and has long been the community-preferred way to work with Spark from R, mostly because of its dplyr-style interface. It's not Databricks' official replacement for SparkR; it's just what most R practitioners were already using. sparklyr added support for Databricks Connect v2 (Spark Connect-based) through a companion package called pysparklyr.

On R UDFs in sparklyr over Spark Connect: this is where it gets interesting. Spark Connect natively only supports Python UDFs — there is no native R UDF pathway over the wire. sparklyr works around this using rpy2, a Python library that embeds and executes R code. The flow: your R code in spark_apply() goes to rpy2 on the client side, which transports it as a Python UDF, and rpy2 on the cluster runs it. That means rpy2 needs to be installed on both ends — local Python environment and the Databricks cluster. It works, and it supports Arrow for efficient data exchange, but it's a bridge, not a native interface. Any R packages your UDF depends on also need to be pre-installed on the cluster, which adds operational overhead.

On the gRPC/protobuf UDF angle: you read the silence correctly. Using the Spark Connect plugin API to extend the protobuf schema and carry custom UDF logic is technically possible, but that's framework-extension territory — not a user-facing interface and not something most application developers would wire up themselves. The practical path for non-Python/JVM clients that need UDF behavior: define the UDF in Python or Scala on the cluster side and call it from your Spark Connect client. Not perfect, but supported and maintainable.

Your 80/20 framing holds. Most Spark workloads are SQL-expressible and Spark Connect handles that well across every client ecosystem. The UDF gap is real for the remaining slice, and how it gets solved for non-Python clients is still evolving — some of the Go and Rust client projects are early-stage and UDF support varies. Worth watching.

Cheers, Louis

View solution in original post