Saturday
Pushing data from databricks (cloud) to Oracle (on-prem) instance?
===================================================
Thanks for reviewing my threads. I find some threads on this subject dated in 2022 by @
We find many options introduced in databricks, so I am opening this thread to get some new insights.
We have a situation, where a databricks tables (large volumes) sitting on cloud. We like to push/ingest these data set to Oracle (on-prem) instance. I saw some post suggests to use jdbc in databricks notebook and spark write to Oracle table.
Is this best option for large volume data?
I learned about a partnership between Oracle + databricks. It give an option to connect to databricks from Oracle (on-prem instance) and read databricks from oracle via connectivity. I did not see much info.
I find recent Databricks roadmap added many more functionality. Is there a functionality in Lakebridge or LakeFlow to get a solution?
Are there any doc/whitepapers on this subject?
Thanks for your insights .
Saturday
Hello @RIDBX , right now there isn’t a dedicated Databricks connector or library for writing directly into on-prem Oracle. We don’t see this pattern very often, so there’s no built-in proprietary option today. That said, I dug through our docs and pulled together the options that do exist so you at least have a few concrete paths to consider.
Spark JDBC (Databricks → Oracle on‑prem): You can write from Databricks to Oracle using the generic Spark JDBC sink. For large volumes, it’s important to control write parallelism (for example, df.repartition(n) before .write.format("jdbc")) so you don’t overwhelm the Oracle instance and network path, and to size partitions to what Oracle can ingest efficiently.
(df.repartition(8) # tune to match Oracle capacity and network
.write
.format("jdbc")
.option("url", "jdbc:oracle:thin:@//host:1521/service_name")
.option("dbtable", "SCHEMA.TARGET_TABLE")
.option("user", "<user>")
.option("password", "<password>")
.mode("append")
.save())
Lakehouse Federation (Oracle → Databricks, read‑only): Databricks supports a managed way to run federated queries against Oracle without moving data into the lakehouse. You create a Unity Catalog connection + foreign catalog and then query Oracle tables from Databricks. Note: federation is read‑only (no writes back to Oracle from federation).
Oracle reading Databricks tables (zero‑copy): If what you want is to access Databricks tables from Oracle, Oracle is a supported consumer of Delta Sharing for read‑only access to Unity Catalog tables (zero‑copy). This lets Oracle read Databricks data without you pushing rows over JDBC in bulk.
Networking considerations (on‑prem Oracle): Regardless of the approach, Databricks compute must be able to reach the on‑prem listener (VPN/Direct Connect/ExpressRoute or equivalent). This is an explicit prerequisite for both JDBC and Lakehouse Federation connections.
It’s a solid and common option, but “best” depends on constraints:
If you truly need to copy large volumes into Oracle and keep doing it regularly, JDBC works but you must throttle/consolidate partitions and batch the writes so the Oracle server and network can keep up, otherwise throughput will collapse or you’ll see back‑pressure/timeouts.
If your requirement is only for Oracle to consume the data (not to “own” it), strongly consider Delta Sharing to Oracle (read‑only, zero‑copy) to avoid expensive copy jobs and eliminate write bottlenecks on Oracle.
For ad hoc or operational reporting where you don’t want to ingest into Databricks first, Lakehouse Federation is ideal (Databricks reads from Oracle on demand), but again this is read‑only and not a solution for pushing data into Oracle.
There isn’t a Databricks product called “Lakebridge.” You might be thinking of Lakehouse Federation (read‑only query over JDBC) or Lakeflow (Databricks’ unified ingestion/orchestration offering).
Lakeflow Connect provides managed ingestion connectors (GA for several sources like SQL Server, Salesforce, etc.). The Lakeflow roadmap explicitly includes Oracle database connectors, with Oracle highlighted in the Lakeflow launch/blog as an upcoming source; timing/features evolve over time.
Start with Spark JDBC write and tune:
Validate network throughput and stability first (on‑prem paths can be the bottleneck even when Oracle is sized correctly).
If the goal is for Oracle‑side consumers to see the data, evaluate Delta Sharing so Oracle can read your Unity Catalog tables directly without copying, which often outperforms bulk copy jobs operationally and cost‑wise (read‑only).
Run federated queries on Oracle (Lakehouse Federation setup, read‑only).
JDBC read/write patterns and performance tuning from Databricks.
Connect to external systems: approach comparison (federation vs. drivers vs. managed ingestion).
Access Databricks data from external systems (Delta Sharing; Oracle supported as a consumer).
Lakeflow overview and launch blog (managed connectors and roadmap mentioning Oracle).
Hope this helps, Louis.
11 hours ago
Thanks for weighing in and proving an interesting insight. Here are some questions coming to my mind upon reviewing this thread.
1) .write.format("jdbc") has .option("dbtable", "SCHEMA.TARGET_TABLE"). Where to specify Databricks table and SCHEMA.TARGET_TABLE columns mapping. Do we need to make Databricks table and Oracle SCHEMA.TARGET_TABLE structures as mirror?
2) If we have many tables to process, do you need to make these for each table in notebook separate cell or there a way to make generic one script and pass tablename/structures as parameter. If we can , how do we that?
3) option("user", "<user>") , .option("password", "<password>") work for own account. If we use service principal account to prod env. How do we configure these credentials as hidden?
4) I was refering to “Lakebridge.”
Thanks for your guidance.
Sunday
Thanks for weighing in and proving an interesting insight. Here are some questions coming to my mind upon reviewing this thread.
1) .write.format("jdbc") has .option("dbtable", "SCHEMA.TARGET_TABLE"). Where to specify Databricks table and SCHEMA.TARGET_TABLE columns mapping. Do we need to make Databricks table and Oracle SCHEMA.TARGET_TABLE structures as mirror?
2) If we have many tables to process, do you need to make these for each table in notebook separate cell or there a way to make generic one script and pass tablename/structures as parameter. If we can , how do we that?
3) option("user", "<user>") , .option("password", "<password>") work for own account. If we use service principal account to prod env. How do we configure these credentials as hidden?
4) I was refering to “Lakebridge.”
Thanks for your guidance.
10 hours ago
Use the built‑in Spark JDBC writer with Oracle’s JDBC driver. It’s the most direct path for writing into on‑prem Oracle and gives you control over batching, parallelism, and commit semantics.
df = spark.read.table("your_source_table") # or a Delta/Parquet source
jdbc_url = "jdbc:oracle:thin:@//<host>:<port>/<service_name>"
props = {
"user": "<oracle_user>",
"password": "<oracle_password>",
"driver": "oracle.jdbc.OracleDriver"
}
# Tune parallelism before write (example: 16 partitions). Don’t set too high.
(df.repartition(16)
.write
.format("jdbc")
.option("url", jdbc_url)
.option("dbtable", "SCHEMA.TARGET_TABLE")
.options(**props)
.mode("append")
.save())
Key tuning guidance for large volumes:
Control write parallelism by repartition(N) before the write; keep N moderate (e.g., 8–32) so you don’t overwhelm Oracle with too many concurrent inserts.
For reliability with constraints, write into a staging table and then MERGE on Oracle to avoid duplication or primary key violations in case of partial failures; speculative execution should be disabled by default.
Network matters: ensure private connectivity from Databricks compute to your on‑prem Oracle (VPN/ExpressRoute/Direct Connect) and allow the relevant ports; this same prerequisite is documented for Lakehouse Federation but applies equally to JDBC writes.
Notes:
Lakehouse Federation lets Databricks query Oracle without migrating data, by creating a Unity Catalog connection and a foreign catalog that mirrors Oracle. This is excellent for ad‑hoc analytics or POCs, but it is read‑only for Oracle—so it does not push data into Oracle tables. Use JDBC writes for ingestion into Oracle as in Option 1.
Oracle Federation requires network connectivity from Databricks compute and uses TLS for Oracle Cloud; other Oracle databases use Oracle’s Native Network Encryption (NNE).
Oracle Federation is supported and was announced in public preview earlier and later GA across clouds; use the docs to set up connections and foreign catalogs in Unity Catalog.
Lakeflow Connect is Databricks’ ingestion (into the Lakehouse) and workflow layer; where both Federation and Lakeflow Connect exist for a source, we recommend Lakeflow Connect for higher volumes and lower latency ingestion into Databricks—but Lakeflow is oriented to bringing data in, not writing out to Oracle as a sink.
8 hours ago
Thanks for weighing in/guidance. I did not see your response/guidance for following:
2) If we have many tables to process, do you need to make these for each table in notebook separate cell or there a way to make generic one script and pass tablename/structures as parameter. If we can , how do we that?
3) option("user", "<user>") , .option("password", "<password>") work for own account. If we use service principal account to prod env. How do we configure these credentials as hidden?
Thanks for your guidance.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now