cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

JDBC with serverless compute

Dedescoat
New Contributor
Hi community,

We have a scenario where we need to ingest data into Lakebase. Currently, we are trying to use JDBC to write data in a notebook with serverless compute. However, the documentation on serverless limitations (link) mentions that JAR libraries are not supported in serverless notebooks, which prevents us from using a JDBC driver for ingestion.
 
 
Recently, a new feature JDBC connection was introduced, this allows JDBC drivers to be stored in a Unity Catalog volume and used through a configured JDBC connection. We tested this approach following the steps, and successfully wrote data into a Lakebase database from a serverless compute.
 
 
However, we noticed that the write performance is significantly slower compared to using a classic compute. We are wondering what might be causing this performance degradation. Is it an inherent limitation of serverless compute, or something specific to how JDBC connections are handled in databricks?
 

Additionally, we would like to confirm whether this is the best approach for writing data into Lakebase. An alternative we are considering is using psycopg2, but that would require manually writing sql statements.
 
Thanks, we'd appreciate any insights you might have.
1 ACCEPTED SOLUTION

Accepted Solutions

Louis_Frolio
Databricks Employee
Databricks Employee

Hello @Dedescoat , I did some poking around in our documentation and would like to offer some tips/tricks to help you further diagnose the issue.  Yes — using a Unity Catalog JDBC connection to load a driver from a UC volume and write from serverless compute is supported, and it’s the right pattern when you need Spark-based writes to an external database. Query federation is read-only, so for any write use case JDBC is the correct path.

The performance difference you’re seeing is very unlikely to be a generic “serverless limitation.” In practice, JDBC write throughput is almost always governed by parallelism, batch sizing, database-side limits, and the network path. Spark issues one JDBC connection per partition, so effective write concurrency is entirely driven by how many partitions you’re writing with.

Let’s break it down.

What you already have right

First, the basics you cited are correct.

Serverless notebooks don’t support installing JAR libraries directly, which is why the old “attach a JDBC driver JAR to the cluster” approach doesn’t work there.

The newer JDBC Unity Catalog connection model is designed for exactly this use case. You place the JDBC driver in a UC volume and reference it via .option("databricks.connection", …). That works across UC compute types, including serverless.

Why serverless can look slower in practice

When serverless appears slower, it’s usually due to one (or more) of these factors rather than a hard cap.

Parallelism on writes

Spark determines how many concurrent JDBC inserts to issue based on the number of partitions in the DataFrame at write time. If your serverless job ends up with fewer partitions than your classic cluster — because of different shuffle behavior, autoscaling characteristics, or simply not repartitioning explicitly — you’ll see fewer concurrent connections and lower throughput. The fix is simple and explicit: repartition to the right level for the target database.

Driver sandbox and memory limits

On UC compute, JDBC drivers run in an isolated sandbox for security reasons. That sandbox has a 400 MiB memory cap. Very large rows or overly aggressive batch sizes can increase memory pressure. If you see memory-related failures or instability, dial back batch sizing or row width.

Network path

Serverless compute requires allow-listed egress or private connectivity to reach your database. Depending on how traffic is routed (NAT, PrivateLink, etc.), latency can differ from a classic cluster with direct peering. Databricks provides stable IP ranges for serverless and a simple connectivity test UDF to validate host, port, and latency. This is worth checking, especially if classic and serverless are using different paths.

Not a blanket limitation

There’s no documented “serverless JDBC write cap.” The JDBC connection is supported on both serverless and standard compute, and performance is generally explained by the factors above, not by an inherent serverless throttle.

Recommended tuning for Spark JDBC writes

If you want predictable and repeatable performance, be explicit.

Control write parallelism

Pick a partition count that matches what your target database can safely handle. Spark uses one connection per partition for JDBC writes. Repartition before .write to control concurrency, and avoid overwhelming the database with too many simultaneous connections.

Tune the JDBC options that matter

Use a moderate batch size. In practice, something in the 2k–10k range is a good starting point. This reduces round-trips without causing large memory spikes. Tune based on your database’s behavior.

Depending on your semantics and driver support, options like isolation level or truncate can also reduce overhead.

Reduce per-row work on the destination

If you’re doing bulk loads, minimize secondary indexes, triggers, and constraints where possible, then add or re-enable them after the load. This is destination-specific, but it often makes a dramatic difference.

Validate connectivity and limits

Use the connectivity test to confirm latency and routing from the compute environment. Make sure serverless IPs are allow-listed or that private connectivity is configured correctly.

Example pattern using a UC JDBC connection

N = 16  # choose based on DB connection limits and expected concurrency

(df
  .repartition(N)  # controls concurrent JDBC connections
  .write
  .format("jdbc")
  .option("databricks.connection", "<your_jdbc_connection_name>")
  .option("dbtable", "<schema.table_or_destination>")
  .option("batchsize", "5000")  # tune for your DB
  .mode("append")
  .save()
)

At query time, users specify databricks.connection, and only options allow-listed on that connection are permitted. Spark controls write concurrency entirely through the number of in-memory partitions.

What about psycopg2 (or other native drivers)?

It’s a valid option, but it’s a different trade-off. A driver like psycopg2 typically uses a single client connection by default. To match Spark’s throughput, you’d need to implement your own parallelism and batching logic.

Spark’s JDBC path gives you governed access via a UC connection and built-in parallel writes across partitions. If your database supports a true bulk loader (for example, COPY), that can be extremely fast, but it usually comes with bespoke logic and less out-of-the-box governance.

When this approach is the right one

Databricks generally recommends the JDBC Unity Catalog connection when you need to write to an external database from Spark, when federation isn’t an option or is read-only, or when you need a specific JDBC driver and fine-grained control over Spark write behavior.

Hope this gives you guidance.

Cheers, Louis

View solution in original post

1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

Hello @Dedescoat , I did some poking around in our documentation and would like to offer some tips/tricks to help you further diagnose the issue.  Yes — using a Unity Catalog JDBC connection to load a driver from a UC volume and write from serverless compute is supported, and it’s the right pattern when you need Spark-based writes to an external database. Query federation is read-only, so for any write use case JDBC is the correct path.

The performance difference you’re seeing is very unlikely to be a generic “serverless limitation.” In practice, JDBC write throughput is almost always governed by parallelism, batch sizing, database-side limits, and the network path. Spark issues one JDBC connection per partition, so effective write concurrency is entirely driven by how many partitions you’re writing with.

Let’s break it down.

What you already have right

First, the basics you cited are correct.

Serverless notebooks don’t support installing JAR libraries directly, which is why the old “attach a JDBC driver JAR to the cluster” approach doesn’t work there.

The newer JDBC Unity Catalog connection model is designed for exactly this use case. You place the JDBC driver in a UC volume and reference it via .option("databricks.connection", …). That works across UC compute types, including serverless.

Why serverless can look slower in practice

When serverless appears slower, it’s usually due to one (or more) of these factors rather than a hard cap.

Parallelism on writes

Spark determines how many concurrent JDBC inserts to issue based on the number of partitions in the DataFrame at write time. If your serverless job ends up with fewer partitions than your classic cluster — because of different shuffle behavior, autoscaling characteristics, or simply not repartitioning explicitly — you’ll see fewer concurrent connections and lower throughput. The fix is simple and explicit: repartition to the right level for the target database.

Driver sandbox and memory limits

On UC compute, JDBC drivers run in an isolated sandbox for security reasons. That sandbox has a 400 MiB memory cap. Very large rows or overly aggressive batch sizes can increase memory pressure. If you see memory-related failures or instability, dial back batch sizing or row width.

Network path

Serverless compute requires allow-listed egress or private connectivity to reach your database. Depending on how traffic is routed (NAT, PrivateLink, etc.), latency can differ from a classic cluster with direct peering. Databricks provides stable IP ranges for serverless and a simple connectivity test UDF to validate host, port, and latency. This is worth checking, especially if classic and serverless are using different paths.

Not a blanket limitation

There’s no documented “serverless JDBC write cap.” The JDBC connection is supported on both serverless and standard compute, and performance is generally explained by the factors above, not by an inherent serverless throttle.

Recommended tuning for Spark JDBC writes

If you want predictable and repeatable performance, be explicit.

Control write parallelism

Pick a partition count that matches what your target database can safely handle. Spark uses one connection per partition for JDBC writes. Repartition before .write to control concurrency, and avoid overwhelming the database with too many simultaneous connections.

Tune the JDBC options that matter

Use a moderate batch size. In practice, something in the 2k–10k range is a good starting point. This reduces round-trips without causing large memory spikes. Tune based on your database’s behavior.

Depending on your semantics and driver support, options like isolation level or truncate can also reduce overhead.

Reduce per-row work on the destination

If you’re doing bulk loads, minimize secondary indexes, triggers, and constraints where possible, then add or re-enable them after the load. This is destination-specific, but it often makes a dramatic difference.

Validate connectivity and limits

Use the connectivity test to confirm latency and routing from the compute environment. Make sure serverless IPs are allow-listed or that private connectivity is configured correctly.

Example pattern using a UC JDBC connection

N = 16  # choose based on DB connection limits and expected concurrency

(df
  .repartition(N)  # controls concurrent JDBC connections
  .write
  .format("jdbc")
  .option("databricks.connection", "<your_jdbc_connection_name>")
  .option("dbtable", "<schema.table_or_destination>")
  .option("batchsize", "5000")  # tune for your DB
  .mode("append")
  .save()
)

At query time, users specify databricks.connection, and only options allow-listed on that connection are permitted. Spark controls write concurrency entirely through the number of in-memory partitions.

What about psycopg2 (or other native drivers)?

It’s a valid option, but it’s a different trade-off. A driver like psycopg2 typically uses a single client connection by default. To match Spark’s throughput, you’d need to implement your own parallelism and batching logic.

Spark’s JDBC path gives you governed access via a UC connection and built-in parallel writes across partitions. If your database supports a true bulk loader (for example, COPY), that can be extremely fast, but it usually comes with bespoke logic and less out-of-the-box governance.

When this approach is the right one

Databricks generally recommends the JDBC Unity Catalog connection when you need to write to an external database from Spark, when federation isn’t an option or is read-only, or when you need a specific JDBC driver and fine-grained control over Spark write behavior.

Hope this gives you guidance.

Cheers, Louis