Re: Apache "Spark Connect"

Louis_Frolio · ‎03-18-2026

Hey @DB1To3 , thanks for some context. Ok, so if you did not know I am an instructor here at Databricks, I teach all our training. As such, I am going to respond in that kind. I did some digging internally and put together a high level overview for you to learn a bit more. It is not meant to be exhaustive but I feel it will give you solid foundation on which you can build. So, here we go ....

Let’s build a clean mental model first, because once that clicks, everything else falls into place.

High-level mental model

If you’re on Databricks Runtime 13.3 LTS or newer, Databricks Connect is essentially a thin client sitting on top of Spark Connect. It lets you run Spark code from your own process — your IDE, an app, a service — while the actual execution happens on Databricks compute.

Here’s the way to think about it:

Your non-Spark code (application logic, orchestration, UI, etc.) runs locally in your process
Your Spark DataFrame operations execute remotely on Databricks
Your UDFs get serialized, shipped over, and executed on the cluster

Under the hood, your client sends unresolved logical plans over gRPC (with Arrow in the mix for data transfer) to a Spark Connect server on the cluster. The cluster executes, then streams results back.

If you keep that separation in your head — local control, remote execution — most of the documentation starts to read a lot more naturally.

Versions / runtimes to target

You’re thinking about this the right way with Spark versions, but on Databricks it’s slightly more nuanced.

Spark Connect shows up in Apache Spark 3.4 and gets much more capable in 3.5 (ML support, better UDF coverage, etc.)
Databricks Connect v2 (the Spark Connect–based version) is supported starting with Databricks Runtime 13.3 LTS
In practice, you’ll want to target the latest LTS runtime that supports it, along with the matching client version

For example, today that typically means something like 16.4 LTS (Spark 3.5.2 under the hood) or newer (I think we are at DBR 18).

If you’re using serverless, the environment tends to stay even more current and already includes a compatible Databricks Connect client (e.g., 17.x with pyspark 4.0-based builds).

So the practical takeaway is: don’t anchor on “Spark 3.5.x” alone — think “DBR 13.3+ and ideally the latest LTS or serverless runtime.”

“Dev only” vs. production usage

This is a common misconception — Databricks Connect is not just a dev tool.

It’s a GA client library designed for building real applications that talk to Databricks. Conceptually, it’s closer to a JDBC/ODBC driver — but instead of SQL, you’re sending full Spark plans.

Typical production patterns look like:

Long-running services embedding databricks-connect to execute Spark workloads remotely
External orchestrators running in containers that use Databricks Connect to talk to clusters or serverless compute

That said, production does introduce a bit more responsibility:

You now have two failure domains (your app + Databricks) with a network in between
You need retry and reconnection logic
Sessions aren’t guaranteed to live forever (especially on serverless), so you should design for recreation

But from a platform perspective, this is a first-class, supported production path — not an afterthought.

Jobs clusters vs. interactive clusters

From a Databricks Connect perspective, there are really just two targets:

Classic compute (all-purpose and jobs clusters)
Serverless compute

Yes, you can connect to a jobs cluster — with a couple of conditions:

The Spark Connect service must be enabled
The cluster must actually be running when you connect

There’s no special lifecycle abstraction here. You still use the standard Jobs or Compute APIs to spin clusters up and down. Databricks Connect just attaches to whatever is available via cluster ID.

Where things get interesting is the mismatch in behavior:

Jobs clusters are ephemeral — they spin up, run, and terminate
Spark Connect assumes a more session-oriented, long-lived interaction model

So while it works, it’s not always a natural fit.

In practice, most teams land on:

All-purpose clusters for interactive or iterative work
Serverless for “always available when needed” compute without lifecycle management

Your intuition is spot on here — the “define the whole job upfront” model doesn’t map cleanly to a long-lived remote session.

Custom clusters with init scripts

This part is refreshingly straightforward.

Yes — Databricks Connect works just fine with custom clusters, including ones that use init scripts.

The only real guardrails are:

Don’t disable the Spark Connect service (for example, via spark.databricks.service.server.enabled=false)
Be careful with low-level configs that could interfere with RPC or serialization

Other than that, there’s nothing special about a Spark-Connect-enabled cluster. It’s just a normal Databricks cluster with a gRPC service exposed.

Libraries and UDFs (Python focus)

For Python, the model is clean:

UDFs you define locally get serialized and shipped to the cluster
They execute on Databricks compute, not in your local process

There’s also explicit support for dependency management:

You can declare Python dependencies that need to exist on the Databricks side for your UDFs
Those dependencies are installed into the execution environment so your UDFs behave as expected

So again, same pattern: define locally, execute remotely.

Takeaway

If you zoom out, everything here reduces to one idea:

You’re developing locally, but executing remotely — and Databricks Connect is the bridge that makes that feel seamless.

Once you internalize that split, the rest — versions, clusters, UDF behavior, even production design — all follow pretty naturally.

Hopefully, this gives you a little insight a greate launching point to dig in deeper.

Cheers, Lou.