02-03-2026 02:52 AM
02-03-2026 08:08 AM
Hello @Dedescoat , I did some poking around in our documentation and would like to offer some tips/tricks to help you further diagnose the issue. Yes — using a Unity Catalog JDBC connection to load a driver from a UC volume and write from serverless compute is supported, and it’s the right pattern when you need Spark-based writes to an external database. Query federation is read-only, so for any write use case JDBC is the correct path.
The performance difference you’re seeing is very unlikely to be a generic “serverless limitation.” In practice, JDBC write throughput is almost always governed by parallelism, batch sizing, database-side limits, and the network path. Spark issues one JDBC connection per partition, so effective write concurrency is entirely driven by how many partitions you’re writing with.
Let’s break it down.
What you already have right
First, the basics you cited are correct.
Serverless notebooks don’t support installing JAR libraries directly, which is why the old “attach a JDBC driver JAR to the cluster” approach doesn’t work there.
The newer JDBC Unity Catalog connection model is designed for exactly this use case. You place the JDBC driver in a UC volume and reference it via .option("databricks.connection", …). That works across UC compute types, including serverless.
Why serverless can look slower in practice
When serverless appears slower, it’s usually due to one (or more) of these factors rather than a hard cap.
Parallelism on writes
Spark determines how many concurrent JDBC inserts to issue based on the number of partitions in the DataFrame at write time. If your serverless job ends up with fewer partitions than your classic cluster — because of different shuffle behavior, autoscaling characteristics, or simply not repartitioning explicitly — you’ll see fewer concurrent connections and lower throughput. The fix is simple and explicit: repartition to the right level for the target database.
Driver sandbox and memory limits
On UC compute, JDBC drivers run in an isolated sandbox for security reasons. That sandbox has a 400 MiB memory cap. Very large rows or overly aggressive batch sizes can increase memory pressure. If you see memory-related failures or instability, dial back batch sizing or row width.
Network path
Serverless compute requires allow-listed egress or private connectivity to reach your database. Depending on how traffic is routed (NAT, PrivateLink, etc.), latency can differ from a classic cluster with direct peering. Databricks provides stable IP ranges for serverless and a simple connectivity test UDF to validate host, port, and latency. This is worth checking, especially if classic and serverless are using different paths.
Not a blanket limitation
There’s no documented “serverless JDBC write cap.” The JDBC connection is supported on both serverless and standard compute, and performance is generally explained by the factors above, not by an inherent serverless throttle.
Recommended tuning for Spark JDBC writes
If you want predictable and repeatable performance, be explicit.
Control write parallelism
Pick a partition count that matches what your target database can safely handle. Spark uses one connection per partition for JDBC writes. Repartition before .write to control concurrency, and avoid overwhelming the database with too many simultaneous connections.
Tune the JDBC options that matter
Use a moderate batch size. In practice, something in the 2k–10k range is a good starting point. This reduces round-trips without causing large memory spikes. Tune based on your database’s behavior.
Depending on your semantics and driver support, options like isolation level or truncate can also reduce overhead.
Reduce per-row work on the destination
If you’re doing bulk loads, minimize secondary indexes, triggers, and constraints where possible, then add or re-enable them after the load. This is destination-specific, but it often makes a dramatic difference.
Validate connectivity and limits
Use the connectivity test to confirm latency and routing from the compute environment. Make sure serverless IPs are allow-listed or that private connectivity is configured correctly.
Example pattern using a UC JDBC connection
N = 16 # choose based on DB connection limits and expected concurrency
(df
.repartition(N) # controls concurrent JDBC connections
.write
.format("jdbc")
.option("databricks.connection", "<your_jdbc_connection_name>")
.option("dbtable", "<schema.table_or_destination>")
.option("batchsize", "5000") # tune for your DB
.mode("append")
.save()
)
At query time, users specify databricks.connection, and only options allow-listed on that connection are permitted. Spark controls write concurrency entirely through the number of in-memory partitions.
What about psycopg2 (or other native drivers)?
It’s a valid option, but it’s a different trade-off. A driver like psycopg2 typically uses a single client connection by default. To match Spark’s throughput, you’d need to implement your own parallelism and batching logic.
Spark’s JDBC path gives you governed access via a UC connection and built-in parallel writes across partitions. If your database supports a true bulk loader (for example, COPY), that can be extremely fast, but it usually comes with bespoke logic and less out-of-the-box governance.
When this approach is the right one
Databricks generally recommends the JDBC Unity Catalog connection when you need to write to an external database from Spark, when federation isn’t an option or is read-only, or when you need a specific JDBC driver and fine-grained control over Spark write behavior.
Hope this gives you guidance.
Cheers, Louis
02-03-2026 08:08 AM
Hello @Dedescoat , I did some poking around in our documentation and would like to offer some tips/tricks to help you further diagnose the issue. Yes — using a Unity Catalog JDBC connection to load a driver from a UC volume and write from serverless compute is supported, and it’s the right pattern when you need Spark-based writes to an external database. Query federation is read-only, so for any write use case JDBC is the correct path.
The performance difference you’re seeing is very unlikely to be a generic “serverless limitation.” In practice, JDBC write throughput is almost always governed by parallelism, batch sizing, database-side limits, and the network path. Spark issues one JDBC connection per partition, so effective write concurrency is entirely driven by how many partitions you’re writing with.
Let’s break it down.
What you already have right
First, the basics you cited are correct.
Serverless notebooks don’t support installing JAR libraries directly, which is why the old “attach a JDBC driver JAR to the cluster” approach doesn’t work there.
The newer JDBC Unity Catalog connection model is designed for exactly this use case. You place the JDBC driver in a UC volume and reference it via .option("databricks.connection", …). That works across UC compute types, including serverless.
Why serverless can look slower in practice
When serverless appears slower, it’s usually due to one (or more) of these factors rather than a hard cap.
Parallelism on writes
Spark determines how many concurrent JDBC inserts to issue based on the number of partitions in the DataFrame at write time. If your serverless job ends up with fewer partitions than your classic cluster — because of different shuffle behavior, autoscaling characteristics, or simply not repartitioning explicitly — you’ll see fewer concurrent connections and lower throughput. The fix is simple and explicit: repartition to the right level for the target database.
Driver sandbox and memory limits
On UC compute, JDBC drivers run in an isolated sandbox for security reasons. That sandbox has a 400 MiB memory cap. Very large rows or overly aggressive batch sizes can increase memory pressure. If you see memory-related failures or instability, dial back batch sizing or row width.
Network path
Serverless compute requires allow-listed egress or private connectivity to reach your database. Depending on how traffic is routed (NAT, PrivateLink, etc.), latency can differ from a classic cluster with direct peering. Databricks provides stable IP ranges for serverless and a simple connectivity test UDF to validate host, port, and latency. This is worth checking, especially if classic and serverless are using different paths.
Not a blanket limitation
There’s no documented “serverless JDBC write cap.” The JDBC connection is supported on both serverless and standard compute, and performance is generally explained by the factors above, not by an inherent serverless throttle.
Recommended tuning for Spark JDBC writes
If you want predictable and repeatable performance, be explicit.
Control write parallelism
Pick a partition count that matches what your target database can safely handle. Spark uses one connection per partition for JDBC writes. Repartition before .write to control concurrency, and avoid overwhelming the database with too many simultaneous connections.
Tune the JDBC options that matter
Use a moderate batch size. In practice, something in the 2k–10k range is a good starting point. This reduces round-trips without causing large memory spikes. Tune based on your database’s behavior.
Depending on your semantics and driver support, options like isolation level or truncate can also reduce overhead.
Reduce per-row work on the destination
If you’re doing bulk loads, minimize secondary indexes, triggers, and constraints where possible, then add or re-enable them after the load. This is destination-specific, but it often makes a dramatic difference.
Validate connectivity and limits
Use the connectivity test to confirm latency and routing from the compute environment. Make sure serverless IPs are allow-listed or that private connectivity is configured correctly.
Example pattern using a UC JDBC connection
N = 16 # choose based on DB connection limits and expected concurrency
(df
.repartition(N) # controls concurrent JDBC connections
.write
.format("jdbc")
.option("databricks.connection", "<your_jdbc_connection_name>")
.option("dbtable", "<schema.table_or_destination>")
.option("batchsize", "5000") # tune for your DB
.mode("append")
.save()
)
At query time, users specify databricks.connection, and only options allow-listed on that connection are permitted. Spark controls write concurrency entirely through the number of in-memory partitions.
What about psycopg2 (or other native drivers)?
It’s a valid option, but it’s a different trade-off. A driver like psycopg2 typically uses a single client connection by default. To match Spark’s throughput, you’d need to implement your own parallelism and batching logic.
Spark’s JDBC path gives you governed access via a UC connection and built-in parallel writes across partitions. If your database supports a true bulk loader (for example, COPY), that can be extremely fast, but it usually comes with bespoke logic and less out-of-the-box governance.
When this approach is the right one
Databricks generally recommends the JDBC Unity Catalog connection when you need to write to an external database from Spark, when federation isn’t an option or is read-only, or when you need a specific JDBC driver and fine-grained control over Spark write behavior.
Hope this gives you guidance.
Cheers, Louis