Databricks Community

RIDBX · ‎11-22-2025

Pushing data from databricks (cloud) to Oracle (on-prem) instance?

===================================================

Thanks for reviewing my threads. I find some threads on this subject dated in 2022 by @

We find many options introduced in databricks, so I am opening this thread to get some new insights.

We have a situation, where a databricks tables (large volumes) sitting on cloud. We like to push/ingest these data set to Oracle (on-prem) instance. I saw some post suggests to use jdbc in databricks notebook and spark write to Oracle table.

Is this best option for large volume data?

I learned about a partnership between Oracle + databricks. It give an option to connect to databricks from Oracle (on-prem instance) and read databricks from oracle via connectivity. I did not see much info.

I find recent Databricks roadmap added many more functionality. Is there a functionality in Lakebridge or LakeFlow to get a solution?

Are there any doc/whitepapers on this subject?

Thanks for your insights .

Louis_Frolio · ‎11-22-2025

Hello @RIDBX , right now there isn’t a dedicated Databricks connector or library for writing directly into on-prem Oracle. We don’t see this pattern very often, so there’s no built-in proprietary option today. That said, I dug through our docs and pulled together the options that do exist so you at least have a few concrete paths to consider.

What’s available today

Spark JDBC (Databricks → Oracle on‑prem): You can write from Databricks to Oracle using the generic Spark JDBC sink. For large volumes, it’s important to control write parallelism (for example, df.repartition(n) before .write.format("jdbc")) so you don’t overwhelm the Oracle instance and network path, and to size partitions to what Oracle can ingest efficiently.

Example (Python):
```
(df.repartition(8)  # tune to match Oracle capacity and network
   .write
   .format("jdbc")
   .option("url", "jdbc:oracle:thin:@//host:1521/service_name")
   .option("dbtable", "SCHEMA.TARGET_TABLE")
   .option("user", "<user>")
   .option("password", "<password>")
   .mode("append")
   .save())
```
Lakehouse Federation (Oracle → Databricks, read‑only): Databricks supports a managed way to run federated queries against Oracle without moving data into the lakehouse. You create a Unity Catalog connection + foreign catalog and then query Oracle tables from Databricks. Note: federation is read‑only (no writes back to Oracle from federation).
Oracle reading Databricks tables (zero‑copy): If what you want is to access Databricks tables from Oracle, Oracle is a supported consumer of Delta Sharing for read‑only access to Unity Catalog tables (zero‑copy). This lets Oracle read Databricks data without you pushing rows over JDBC in bulk.
Networking considerations (on‑prem Oracle): Regardless of the approach, Databricks compute must be able to reach the on‑prem listener (VPN/Direct Connect/ExpressRoute or equivalent). This is an explicit prerequisite for both JDBC and Lakehouse Federation connections.

Is JDBC the “best” option for large volumes?

It’s a solid and common option, but “best” depends on constraints:

If you truly need to copy large volumes into Oracle and keep doing it regularly, JDBC works but you must throttle/consolidate partitions and batch the writes so the Oracle server and network can keep up, otherwise throughput will collapse or you’ll see back‑pressure/timeouts.

If your requirement is only for Oracle to consume the data (not to “own” it), strongly consider Delta Sharing to Oracle (read‑only, zero‑copy) to avoid expensive copy jobs and eliminate write bottlenecks on Oracle.
For ad hoc or operational reporting where you don’t want to ingest into Databricks first, Lakehouse Federation is ideal (Databricks reads from Oracle on demand), but again this is read‑only and not a solution for pushing data into Oracle.

About “Lakebridge” vs. Lakeflow

There isn’t a Databricks product called “Lakebridge.” You might be thinking of Lakehouse Federation (read‑only query over JDBC) or Lakeflow (Databricks’ unified ingestion/orchestration offering).
Lakeflow Connect provides managed ingestion connectors (GA for several sources like SQL Server, Salesforce, etc.). The Lakeflow roadmap explicitly includes Oracle database connectors, with Oracle highlighted in the Lakeflow launch/blog as an upcoming source; timing/features evolve over time.

Today, if you need Databricks → Oracle loads, Lakeflow doesn’t provide a managed “reverse ETL to Oracle” sink; you’d use Spark JDBC as shown above.

Practical guidance (large volume to Oracle)

Start with Spark JDBC write and tune:
- Repartition to a small, fixed number that Oracle can handle (e.g., 4–16), and size clusters accordingly.
- If you currently have very fine‑grained partitions, coalesce before writing to reduce concurrent connections/inserts.
Validate network throughput and stability first (on‑prem paths can be the bottleneck even when Oracle is sized correctly).
If the goal is for Oracle‑side consumers to see the data, evaluate Delta Sharing so Oracle can read your Unity Catalog tables directly without copying, which often outperforms bulk copy jobs operationally and cost‑wise (read‑only).

Good docs to bookmark

Run federated queries on Oracle (Lakehouse Federation setup, read‑only).
JDBC read/write patterns and performance tuning from Databricks.
Connect to external systems: approach comparison (federation vs. drivers vs. managed ingestion).
Access Databricks data from external systems (Delta Sharing; Oracle supported as a consumer).
Lakeflow overview and launch blog (managed connectors and roadmap mentioning Oracle).

Hope this helps, Louis.

RIDBX · ‎11-26-2025

Thanks for weighing in and proving an interesting insight. Here are some questions coming to my mind upon reviewing this thread.

1) .write.format("jdbc") has .option("dbtable", "SCHEMA.TARGET_TABLE"). Where to specify Databricks table and SCHEMA.TARGET_TABLE columns mapping. Do we need to make Databricks table and Oracle SCHEMA.TARGET_TABLE structures as mirror?

2) If we have many tables to process, do you need to make these for each table in notebook separate cell or there a way to make generic one script and pass tablename/structures as parameter. If we can , how do we that?

3) option("user", "<user>") , .option("password", "<password>") work for own account. If we use service principal account to prod env. How do we configure these credentials as hidden?

4) I was refering to “Lakebridge.”

Thanks for your guidance.

RIDBX · ‎11-23-2025

Thanks for weighing in and proving an interesting insight. Here are some questions coming to my mind upon reviewing this thread.

1) .write.format("jdbc") has .option("dbtable", "SCHEMA.TARGET_TABLE"). Where to specify Databricks table and SCHEMA.TARGET_TABLE columns mapping. Do we need to make Databricks table and Oracle SCHEMA.TARGET_TABLE structures as mirror?

2) If we have many tables to process, do you need to make these for each table in notebook separate cell or there a way to make generic one script and pass tablename/structures as parameter. If we can , how do we that?

3) option("user", "<user>") , .option("password", "<password>") work for own account. If we use service principal account to prod env. How do we configure these credentials as hidden?

4) I was refering to “Lakebridge.”

Thanks for your guidance.

iyashk-DB · ‎11-26-2025

Option 1: Spark JDBC write from Databricks to Oracle (recommended for “push”/ingestion)

Use the built‑in Spark JDBC writer with Oracle’s JDBC driver. It’s the most direct path for writing into on‑prem Oracle and gives you control over batching, parallelism, and commit semantics.

df = spark.read.table("your_source_table")  # or a Delta/Parquet source

jdbc_url = "jdbc:oracle:thin:@//<host>:<port>/<service_name>"
props = {
  "user": "<oracle_user>",
  "password": "<oracle_password>",
  "driver": "oracle.jdbc.OracleDriver"
}

# Tune parallelism before write (example: 16 partitions). Don’t set too high.
(df.repartition(16)
   .write
   .format("jdbc")
   .option("url", jdbc_url)
   .option("dbtable", "SCHEMA.TARGET_TABLE")
   .options(**props)
   .mode("append")
   .save())

Key tuning guidance for large volumes:

Control write parallelism by repartition(N) before the write; keep N moderate (e.g., 8–32) so you don’t overwhelm Oracle with too many concurrent inserts.
For reliability with constraints, write into a staging table and then MERGE on Oracle to avoid duplication or primary key violations in case of partial failures; speculative execution should be disabled by default.
Network matters: ensure private connectivity from Databricks compute to your on‑prem Oracle (VPN/ExpressRoute/Direct Connect) and allow the relevant ports; this same prerequisite is documented for Lakehouse Federation but applies equally to JDBC writes.

Notes:

Fetch size tuning is for reads, not writes. However, if you perform any JDBC reads from Oracle before transforming/writing, Oracle’s default fetchSize is small (10). Increasing it (e.g., to 100–1000, workload‑dependent) improves read throughput.

Option 2: Lakehouse Federation with Oracle (read-only; not a sink)

Lakehouse Federation lets Databricks query Oracle without migrating data, by creating a Unity Catalog connection and a foreign catalog that mirrors Oracle. This is excellent for ad‑hoc analytics or POCs, but it is read‑only for Oracle—so it does not push data into Oracle tables. Use JDBC writes for ingestion into Oracle as in Option 1.

Oracle Federation requires network connectivity from Databricks compute and uses TLS for Oracle Cloud; other Oracle databases use Oracle’s Native Network Encryption (NNE).
Oracle Federation is supported and was announced in public preview earlier and later GA across clouds; use the docs to set up connections and foreign catalogs in Unity Catalog.

Option 3: Lakeflow (Connect/Jobs) and how it relates

Lakeflow Connect is Databricks’ ingestion (into the Lakehouse) and workflow layer; where both Federation and Lakeflow Connect exist for a source, we recommend Lakeflow Connect for higher volumes and lower latency ingestion into Databricks—but Lakeflow is oriented to bringing data in, not writing out to Oracle as a sink.

RIDBX · ‎11-26-2025

Thanks for weighing in/guidance. I did not see your response/guidance for following:

2) If we have many tables to process, do you need to make these for each table in notebook separate cell or there a way to make generic one script and pass tablename/structures as parameter. If we can , how do we that?

3) option("user", "<user>") , .option("password", "<password>") work for own account. If we use service principal account to prod env. How do we configure these credentials as hidden?

Thanks for your guidance.

iyashk-DB · ‎11-27-2025

2) Run many tables with a generic script (parameterized)

We suggest that you define a list that maps each Databricks source to its corresponding Oracle target and per-table options (partitions, mode, and pre/post SQL). Then loop over it in one notebook or job.

Something like as follows for an example (Python, single script):

# 1) Table manifest (could also come from a Delta table)
manifest = [
  {
    "source_table": "catalog.schema.table1",
    "target_table": "SCHEMA.TABLE1",
    "partitions": 16,
    "mode": "append"
  },
  {
    "source_table": "catalog.schema.table2",
    "target_table": "SCHEMA.TABLE2",
    "partitions": 8,
    "mode": "overwrite"  # or "append"
  },
  # ...
]

# 2) Secrets-backed credentials (see section 3)
oracle_user = dbutils.secrets.get("oracle-prod", "username")
oracle_pwd  = dbutils.secrets.get("oracle-prod", "password")

jdbc_url = "jdbc:oracle:thin:@//<host>:<port>/<service_name>"
common_props = {
  "user": oracle_user,
  "password": oracle_pwd,
  "driver": "oracle.jdbc.OracleDriver",
  # Optional write tuning:
  # "batchsize": "1000",
}

def write_one(spec):
  df = spark.read.table(spec["source_table"])
  n = int(spec.get("partitions", 8))
  mode = spec.get("mode", "append")
  (df.repartition(n)               # control parallelism responsibly
     .write
     .format("jdbc")
     .option("url", jdbc_url)
     .option("dbtable", spec["target_table"])
     .options(**common_props)
     .mode(mode)
     .save())

for spec in manifest:
  write_one(spec)

Why this works at scale:

You control write parallelism per table using repartition(N); keep N moderate (e.g., 8–32) to avoid overloading Oracle.
One notebook can process all tables; use Databricks Jobs to pass parameters (e.g., a manifest path or table filter) and to run subsets in parallel with controlled concurrency.

3) Hide credentials (service account / non‑human account)

Use Databricks Secrets so credentials never appear in code, logs, or UI:

# Store once in a secret scope (e.g., backed by a cloud key vault)
oracle_user = dbutils.secrets.get(scope="oracle-prod", key="username")
oracle_pwd  = dbutils.secrets.get(scope="oracle-prod", key="password")

props = {
  "user": oracle_user,
  "password": oracle_pwd,
  "driver": "oracle.jdbc.OracleDriver"
}

Databricks explicitly recommends storing JDBC usernames/passwords in secrets rather than embedding them in URLs or notebooks; retrieve them at runtime with dbutils.secrets.get(...). This works for both human users and service accounts created in Oracle for production use.