a week ago
Can someone confirm if this is the right message board for discussing the opensource Apache core of "Spark Connect". (aka databricks connect)
We are hosting workloads on Azure Databricks, but would like to ensure that these workloads are following the patterns and practices which are compatible with the opensource flavor of this technology ("Spark Connect").
It is an exciting tool, and is probably one that can be used in general-purpose application development, (eg. whenever compute options are limited). In these cases, it would be nice to do a hand-off to a remote service that has limitless compute for processing data.
Friday
@DB1To3 ,
Great questions — let me take them one at a time.
On the jobs cluster + Spark Connect timing scenario: yes, the cluster will shut down on you. A jobs cluster is lifecycle-bound to the job run that launched it — the scheduler creates it, runs the job, and terminates it when the job is done. There is no mechanism that detects an external Spark Connect session mid-execution and waits for it to finish. Your remote client gets dropped, and any in-flight plan execution dies with it.
Worth flagging: Databricks Connect is designed to target all-purpose clusters, not jobs clusters. If your use case involves remote clients running iterative or long-running Spark work, an all-purpose cluster is the right fit. You configure auto-termination based on inactivity, and active Spark Connect sessions count as activity. If you want fully managed lifecycle with no cluster provisioning overhead, serverless compute is increasingly where Databricks is pointing people for this pattern.
On SparkR vs. sparklyr — these are two distinct things, and it's worth separating them. SparkR is the official Apache Spark R package, maintained by the Apache project. Databricks deprecated their support for it in their runtimes. sparklyr is a separate R package from Posit (formerly RStudio) — it predates the SparkR deprecation and has long been the community-preferred way to work with Spark from R, mostly because of its dplyr-style interface. It's not Databricks' official replacement for SparkR; it's just what most R practitioners were already using. sparklyr added support for Databricks Connect v2 (Spark Connect-based) through a companion package called pysparklyr.
On R UDFs in sparklyr over Spark Connect: this is where it gets interesting. Spark Connect natively only supports Python UDFs — there is no native R UDF pathway over the wire. sparklyr works around this using rpy2, a Python library that embeds and executes R code. The flow: your R code in spark_apply() goes to rpy2 on the client side, which transports it as a Python UDF, and rpy2 on the cluster runs it. That means rpy2 needs to be installed on both ends — local Python environment and the Databricks cluster. It works, and it supports Arrow for efficient data exchange, but it's a bridge, not a native interface. Any R packages your UDF depends on also need to be pre-installed on the cluster, which adds operational overhead.
On the gRPC/protobuf UDF angle: you read the silence correctly. Using the Spark Connect plugin API to extend the protobuf schema and carry custom UDF logic is technically possible, but that's framework-extension territory — not a user-facing interface and not something most application developers would wire up themselves. The practical path for non-Python/JVM clients that need UDF behavior: define the UDF in Python or Scala on the cluster side and call it from your Spark Connect client. Not perfect, but supported and maintainable.
Your 80/20 framing holds. Most Spark workloads are SQL-expressible and Spark Connect handles that well across every client ecosystem. The UDF gap is real for the remaining slice, and how it gets solved for non-Python clients is still evolving — some of the Go and Rust client projects are early-stage and UDF support varies. Worth watching.
Cheers, Louis
a week ago
@DB1To3 , we don't have a dedicated Spark channel. With that said, many Spark questions do land here in the Data Engineering discussion group. So, yes, you are in the right place.
If you can provide more details (bulleted list perhaps?) I am happy to assist.
Cheers, Lou.
Wednesday
Hi @Louis_Frolio
Thanks for the reply. I've used spark clusters in a lot of places (on-prem, HDI, Azure Synapse, Fabric, and now Databricks). So I'm pretty familiar with spark itself. Some of these hosted platforms don't really promote the use of spark-connect yet, which is why my experience is pretty limited. Additionally, my spark experience is primarily with v.3.3 and prior (before the time when the technology was introduced into the Apache community).
This will be the very first time I'm digging into the behavior of spark-connect/databricks-connect.
Fair warning - I've only made it thru a very limited amount of documentation. I'm trying to build a mental framework to understand any docs I may find about databricks-connect. Here are some questions to start with. I will post other questions separately going forward.
- It looks like this was introduced in 3.4 and substantially more investment was made in 3.5.x. I'm assuming that for the best experience we should be using DBR runtimes for 3.5.x or beyond?
- I realize this technology is frequently used for the sake of development (ie. running notebooks in a local IDE rather than a web-browser, while still having full exposure to the cluster). However can you confirm that it is also fully supported for production scenarios as well?
- Can we use databricks-connect with a so-called "jobs clusters", in addition to the longer-running "interactive clusters"? At a high level, what is the mechanism to launch a jobs cluster and keep it running for the sake of databricks-connect clients?
Conceptually speaking, a jobs-cluster API doesn't seem very compatible with spark connect because you have to tell it what work to do up front. Whereas a remote client of spark-connect would require a spark cluster/server to be running continuously (I believe).
- Can we use databricks-connect against custom clusters that are intialized with init scripts?
- Does this technology allow libraries to be submitted from a job, and subsequently execute those libraries in UDF's on the cluster? Is that possible for both scala/java and python?
- From a databricks-connect client, can we upload other files to the cluster as well (native libraries or data files)?
- I think GRPC is the communication mechanism to submit work from a client application. Building on that same interfacing technology, are you aware of any common/standard pattern where UDF is defined as a GRPC/protobuf request to a remote endpoint (without passing thru any python runtime). Obviously the spark-core has an API layer to interact with python UDF's, but I'm wondering if there is a smooth way for UDF's to execute GRPC/protobuf calls (without relying on the need to interact with any intermediate python layer on the executor)
Thanks in advance, the answers will help me gain a better mental framework for this technology. I know that one of the benefits of databricks-connect is to give a much more flexible interop with other ecosystems (outside of JVM and python - which are probably still recommended - but possibly no longer required). I'm pretty certain that there will be limitations and compromises when using databricks-connects instead of conventional normal drivers and executors. I'm trying to understand how customers will need to adapt, if we would like to start adopting databricks-connect for our solutions.
Wednesday
Hey @DB1To3 , thanks for some context. Ok, so if you did not know I am an instructor here at Databricks, I teach all our training. As such, I am going to respond in that kind. I did some digging internally and put together a high level overview for you to learn a bit more. It is not meant to be exhaustive but I feel it will give you solid foundation on which you can build. So, here we go ....
Let’s build a clean mental model first, because once that clicks, everything else falls into place.
High-level mental model
If you’re on Databricks Runtime 13.3 LTS or newer, Databricks Connect is essentially a thin client sitting on top of Spark Connect. It lets you run Spark code from your own process — your IDE, an app, a service — while the actual execution happens on Databricks compute.
Here’s the way to think about it:
Your non-Spark code (application logic, orchestration, UI, etc.) runs locally in your process
Your Spark DataFrame operations execute remotely on Databricks
Your UDFs get serialized, shipped over, and executed on the cluster
Under the hood, your client sends unresolved logical plans over gRPC (with Arrow in the mix for data transfer) to a Spark Connect server on the cluster. The cluster executes, then streams results back.
If you keep that separation in your head — local control, remote execution — most of the documentation starts to read a lot more naturally.
Versions / runtimes to target
You’re thinking about this the right way with Spark versions, but on Databricks it’s slightly more nuanced.
Spark Connect shows up in Apache Spark 3.4 and gets much more capable in 3.5 (ML support, better UDF coverage, etc.)
Databricks Connect v2 (the Spark Connect–based version) is supported starting with Databricks Runtime 13.3 LTS
In practice, you’ll want to target the latest LTS runtime that supports it, along with the matching client version
For example, today that typically means something like 16.4 LTS (Spark 3.5.2 under the hood) or newer (I think we are at DBR 18).
If you’re using serverless, the environment tends to stay even more current and already includes a compatible Databricks Connect client (e.g., 17.x with pyspark 4.0-based builds).
So the practical takeaway is: don’t anchor on “Spark 3.5.x” alone — think “DBR 13.3+ and ideally the latest LTS or serverless runtime.”
“Dev only” vs. production usage
This is a common misconception — Databricks Connect is not just a dev tool.
It’s a GA client library designed for building real applications that talk to Databricks. Conceptually, it’s closer to a JDBC/ODBC driver — but instead of SQL, you’re sending full Spark plans.
Typical production patterns look like:
Long-running services embedding databricks-connect to execute Spark workloads remotely
External orchestrators running in containers that use Databricks Connect to talk to clusters or serverless compute
That said, production does introduce a bit more responsibility:
You now have two failure domains (your app + Databricks) with a network in between
You need retry and reconnection logic
Sessions aren’t guaranteed to live forever (especially on serverless), so you should design for recreation
But from a platform perspective, this is a first-class, supported production path — not an afterthought.
Jobs clusters vs. interactive clusters
From a Databricks Connect perspective, there are really just two targets:
Classic compute (all-purpose and jobs clusters)
Serverless compute
Yes, you can connect to a jobs cluster — with a couple of conditions:
The Spark Connect service must be enabled
The cluster must actually be running when you connect
There’s no special lifecycle abstraction here. You still use the standard Jobs or Compute APIs to spin clusters up and down. Databricks Connect just attaches to whatever is available via cluster ID.
Where things get interesting is the mismatch in behavior:
Jobs clusters are ephemeral — they spin up, run, and terminate
Spark Connect assumes a more session-oriented, long-lived interaction model
So while it works, it’s not always a natural fit.
In practice, most teams land on:
All-purpose clusters for interactive or iterative work
Serverless for “always available when needed” compute without lifecycle management
Your intuition is spot on here — the “define the whole job upfront” model doesn’t map cleanly to a long-lived remote session.
Custom clusters with init scripts
This part is refreshingly straightforward.
Yes — Databricks Connect works just fine with custom clusters, including ones that use init scripts.
The only real guardrails are:
Don’t disable the Spark Connect service (for example, via spark.databricks.service.server.enabled=false)
Be careful with low-level configs that could interfere with RPC or serialization
Other than that, there’s nothing special about a Spark-Connect-enabled cluster. It’s just a normal Databricks cluster with a gRPC service exposed.
Libraries and UDFs (Python focus)
For Python, the model is clean:
UDFs you define locally get serialized and shipped to the cluster
They execute on Databricks compute, not in your local process
There’s also explicit support for dependency management:
You can declare Python dependencies that need to exist on the Databricks side for your UDFs
Those dependencies are installed into the execution environment so your UDFs behave as expected
So again, same pattern: define locally, execute remotely.
Takeaway
If you zoom out, everything here reduces to one idea:
You’re developing locally, but executing remotely — and Databricks Connect is the bridge that makes that feel seamless.
Once you internalize that split, the rest — versions, clusters, UDF behavior, even production design — all follow pretty naturally.
Hopefully, this gives you a little insight a greate launching point to dig in deeper.
Cheers, Lou.
Wednesday
@Louis_Frolio
Thanks a lot for these details. I am always a little anxious about the roadmap for cloud-hosted services like Spark Connect. I have seen certain Spark-derived technologies get rug-pulled in the past - especially stuff that is built outside the context of OSS Spark from Apache.
Not to be too hard on Microsoft, but they are NOT the greatest stewards of Spark. They have introduced innovations for customers on top of Spark, and then abandoned them in just a matter of months with little or no warning. It sounds like that will not happen with Spark Connect (or at least we can probably count on a multi-year warning before this particular technology is removed).
However even Databricks may introduce changes to their roadmap. For example SparkR was killed, as I understand, and I think the replacement is something called "sparklyr". (I believe that "sparklyr" is based on Spark Connect, come to think of it, and perhaps that alone should give us confidence that this technology will have legs! ... I don't think the R customers would appreciate another rug-pull in the near future, right after the last one.)
Can help me understand a jobs cluster scenario? If I have an ephemeral job running in a jobs cluster (say a driver that sleeps and times out in ten minutes). And meanwhile I connect to that cluster ID from a remote spark-connect client-app and transmit 50% of the steps that are needed - before I run out of time and the ten minutes are finished. In that case will the job cluster shut down on me? Or will it be accommodating and patiently wait for remotely-connected clients to finish? Would it kill a spark-connect plan half-way thru the execution, or would it at least wait for executing plans to be completed?
Insofar as using GRPC/protobuf for a UDF implementation (a UDF implementation that substitutes in the place python), I'm assuming that this is not very common or you would have referenced it in some way. The reason I ask is because in the past there used to be some other UDF interfaces for languages like R and .Net, but unfortunately databricks didn't see enough value in those and didn't keep them around. Without native UDF interfaces for those languages, I'm hoping a GRPC/protobuf flavor of UDF's would potentially serve a similar purpose.
To ask the question another way, what do you know how "sparklyr" sends UDF's to the cluster? Do those users have to rely on UDF's written in python?
For companies that don't have a massive number of JVM or python developers, I think Spark-Connect is pretty compelling. The thing that attracts me to it is the number of ecosystems that can be supported. Nobody has to be left out in the cold anymore:
see:
https://spark.apache.org/spark-connect/#:~:text=Spark%20Connect%20lets%20you%20build%20Spark%20Conne...
Hopefully all those ecosystems will have some approach for UDF's as well, or it costs ~20% of the benefits of spark. (They have to come up with funky alternatives to UDF, that would probably run outside of the cluster.) Admittedly most clients are sending 80% of the work to this cluster in the form of Spark-SQL statements, but that doesn't necessarily get folks all the way to the finish line.
Friday
@DB1To3 ,
Great questions — let me take them one at a time.
On the jobs cluster + Spark Connect timing scenario: yes, the cluster will shut down on you. A jobs cluster is lifecycle-bound to the job run that launched it — the scheduler creates it, runs the job, and terminates it when the job is done. There is no mechanism that detects an external Spark Connect session mid-execution and waits for it to finish. Your remote client gets dropped, and any in-flight plan execution dies with it.
Worth flagging: Databricks Connect is designed to target all-purpose clusters, not jobs clusters. If your use case involves remote clients running iterative or long-running Spark work, an all-purpose cluster is the right fit. You configure auto-termination based on inactivity, and active Spark Connect sessions count as activity. If you want fully managed lifecycle with no cluster provisioning overhead, serverless compute is increasingly where Databricks is pointing people for this pattern.
On SparkR vs. sparklyr — these are two distinct things, and it's worth separating them. SparkR is the official Apache Spark R package, maintained by the Apache project. Databricks deprecated their support for it in their runtimes. sparklyr is a separate R package from Posit (formerly RStudio) — it predates the SparkR deprecation and has long been the community-preferred way to work with Spark from R, mostly because of its dplyr-style interface. It's not Databricks' official replacement for SparkR; it's just what most R practitioners were already using. sparklyr added support for Databricks Connect v2 (Spark Connect-based) through a companion package called pysparklyr.
On R UDFs in sparklyr over Spark Connect: this is where it gets interesting. Spark Connect natively only supports Python UDFs — there is no native R UDF pathway over the wire. sparklyr works around this using rpy2, a Python library that embeds and executes R code. The flow: your R code in spark_apply() goes to rpy2 on the client side, which transports it as a Python UDF, and rpy2 on the cluster runs it. That means rpy2 needs to be installed on both ends — local Python environment and the Databricks cluster. It works, and it supports Arrow for efficient data exchange, but it's a bridge, not a native interface. Any R packages your UDF depends on also need to be pre-installed on the cluster, which adds operational overhead.
On the gRPC/protobuf UDF angle: you read the silence correctly. Using the Spark Connect plugin API to extend the protobuf schema and carry custom UDF logic is technically possible, but that's framework-extension territory — not a user-facing interface and not something most application developers would wire up themselves. The practical path for non-Python/JVM clients that need UDF behavior: define the UDF in Python or Scala on the cluster side and call it from your Spark Connect client. Not perfect, but supported and maintainable.
Your 80/20 framing holds. Most Spark workloads are SQL-expressible and Spark Connect handles that well across every client ecosystem. The UDF gap is real for the remaining slice, and how it gets solved for non-Python clients is still evolving — some of the Go and Rust client projects are early-stage and UDF support varies. Worth watching.
Cheers, Louis
Sunday - last edited Sunday
>> there is no native R UDF pathway over the wire. sparklyr works around this using rpy2, a Python library that embeds and executes R code
This is interesting. I would not think of python as the best runtime for bridging. I'm wondering if this involves yet another out-of-process hop, together with more serialization and deserialization of the arrow dataframes. I'm also wondering why it wouldn't be possible to simply bypass python and launch the R udf's directly from the Spark core (jvm). It seems like there are lots of hops to execute R logic on executors.
I will try to find this community of sparklyr users to learn more. I'm guessing these folks can be found on the github project, and on r/stats.
I hoped that sparklyr would be an official databricks replacement for SparkR. I'm guessing that users in this community would have a hard time getting official support, if things should ever break after a new release of the databricks runtime. I suppose the users rely on one another in the community for support
I hope that Databricks will have some guidance for running non-python UDF's in the future. Most of our internal libraries are built using the .Net runtime (.Net core) and databricks seems to be deliberately neglecting that ecosystem for some reason. I think the c#.Net community is still growing at a VERY fast rate and may even overtake Java itself in a couple of years. Despite this large community of potential customers of Spark, I get the sense that Databricks has no interest in removing the barriers to entry for all of these folks. I certainly agree that python integrations are always popular ( because the language/runtime/tools are free and widely accessible). However it seems quite strange to me that Databricks would start making accommodations for R/Go/Rust, while doing almost nothing for the c#.Net developers. I'd guess there has to be some unfortunate politics behind this strategy. It goes beyond Databricks. Despite the fact that c# is an opensource platform nowadays, the opensource communities like Apache still don't want to accept it.