Re: Apache "Spark Connect"

DB1To3 · ‎03-18-2026

Hi @Louis_Frolio
Thanks for the reply. I've used spark clusters in a lot of places (on-prem, HDI, Azure Synapse, Fabric, and now Databricks). So I'm pretty familiar with spark itself. Some of these hosted platforms don't really promote the use of spark-connect yet, which is why my experience is pretty limited. Additionally, my spark experience is primarily with v.3.3 and prior (before the time when the technology was introduced into the Apache community).

This will be the very first time I'm digging into the behavior of spark-connect/databricks-connect.

Fair warning - I've only made it thru a very limited amount of documentation. I'm trying to build a mental framework to understand any docs I may find about databricks-connect. Here are some questions to start with. I will post other questions separately going forward.

- It looks like this was introduced in 3.4 and substantially more investment was made in 3.5.x. I'm assuming that for the best experience we should be using DBR runtimes for 3.5.x or beyond?

- I realize this technology is frequently used for the sake of development (ie. running notebooks in a local IDE rather than a web-browser, while still having full exposure to the cluster). However can you confirm that it is also fully supported for production scenarios as well?

- Can we use databricks-connect with a so-called "jobs clusters", in addition to the longer-running "interactive clusters"? At a high level, what is the mechanism to launch a jobs cluster and keep it running for the sake of databricks-connect clients?
Conceptually speaking, a jobs-cluster API doesn't seem very compatible with spark connect because you have to tell it what work to do up front. Whereas a remote client of spark-connect would require a spark cluster/server to be running continuously (I believe).

- Can we use databricks-connect against custom clusters that are intialized with init scripts?

- Does this technology allow libraries to be submitted from a job, and subsequently execute those libraries in UDF's on the cluster? Is that possible for both scala/java and python?

- From a databricks-connect client, can we upload other files to the cluster as well (native libraries or data files)?

- I think GRPC is the communication mechanism to submit work from a client application. Building on that same interfacing technology, are you aware of any common/standard pattern where UDF is defined as a GRPC/protobuf request to a remote endpoint (without passing thru any python runtime). Obviously the spark-core has an API layer to interact with python UDF's, but I'm wondering if there is a smooth way for UDF's to execute GRPC/protobuf calls (without relying on the need to interact with any intermediate python layer on the executor)

Thanks in advance, the answers will help me gain a better mental framework for this technology. I know that one of the benefits of databricks-connect is to give a much more flexible interop with other ecosystems (outside of JVM and python - which are probably still recommended - but possibly no longer required). I'm pretty certain that there will be limitations and compromises when using databricks-connects instead of conventional normal drivers and executors. I'm trying to understand how customers will need to adapt, if we would like to start adopting databricks-connect for our solutions.