Re: Apache "Spark Connect"

DB1To3 · ‎03-18-2026

@Louis_Frolio
Thanks a lot for these details. I am always a little anxious about the roadmap for cloud-hosted services like Spark Connect. I have seen certain Spark-derived technologies get rug-pulled in the past - especially stuff that is built outside the context of OSS Spark from Apache.

Not to be too hard on Microsoft, but they are NOT the greatest stewards of Spark. They have introduced innovations for customers on top of Spark, and then abandoned them in just a matter of months with little or no warning. It sounds like that will not happen with Spark Connect (or at least we can probably count on a multi-year warning before this particular technology is removed).

However even Databricks may introduce changes to their roadmap. For example SparkR was killed, as I understand, and I think the replacement is something called "sparklyr". (I believe that "sparklyr" is based on Spark Connect, come to think of it, and perhaps that alone should give us confidence that this technology will have legs! ... I don't think the R customers would appreciate another rug-pull in the near future, right after the last one.)

Can help me understand a jobs cluster scenario? If I have an ephemeral job running in a jobs cluster (say a driver that sleeps and times out in ten minutes). And meanwhile I connect to that cluster ID from a remote spark-connect client-app and transmit 50% of the steps that are needed - before I run out of time and the ten minutes are finished. In that case will the job cluster shut down on me? Or will it be accommodating and patiently wait for remotely-connected clients to finish? Would it kill a spark-connect plan half-way thru the execution, or would it at least wait for executing plans to be completed?

Insofar as using GRPC/protobuf for a UDF implementation (a UDF implementation that substitutes in the place python), I'm assuming that this is not very common or you would have referenced it in some way. The reason I ask is because in the past there used to be some other UDF interfaces for languages like R and .Net, but unfortunately databricks didn't see enough value in those and didn't keep them around. Without native UDF interfaces for those languages, I'm hoping a GRPC/protobuf flavor of UDF's would potentially serve a similar purpose.

To ask the question another way, what do you know how "sparklyr" sends UDF's to the cluster? Do those users have to rely on UDF's written in python?

For companies that don't have a massive number of JVM or python developers, I think Spark-Connect is pretty compelling. The thing that attracts me to it is the number of ecosystems that can be supported. Nobody has to be left out in the cold anymore:

sparklr...
Spark Connect Python
Spark Connect Go
Spark Connect Rust
Spark Connect Swift
Spark Connect .NET
etc.

see:
https://spark.apache.org/spark-connect/#:~:text=Spark%20Connect%20lets%20you%20build%20Spark%20Conne...

Hopefully all those ecosystems will have some approach for UDF's as well, or it costs ~20% of the benefits of spark. (They have to come up with funky alternatives to UDF, that would probably run outside of the cluster.) Admittedly most clients are sending 80% of the work to this cluster in the form of Spark-SQL statements, but that doesn't necessarily get folks all the way to the finish line.