cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Pushing data from databricks (cloud) to Oracle (on-prem) instance?

RIDBX
New Contributor III

Pushing data from databricks (cloud) to Oracle (on-prem) instance?

===================================================


Thanks for reviewing my threads. I find some threads on this subject dated in 2022 by @

 

Databricks to Oracle  

We find many options introduced in databricks, so I am opening this thread to get some new insights.

We have a situation, where a databricks tables (large volumes) sitting on cloud. We like to push/ingest these data set to Oracle (on-prem) instance. I saw some post suggests to use jdbc in databricks notebook and spark write to Oracle table.

Is this best option for large volume data?

I learned about a partnership between Oracle + databricks. It give an option to connect to databricks from Oracle (on-prem instance) and read databricks from oracle via connectivity. I did not see much info.

I find recent Databricks roadmap added many more functionality. Is there a functionality in Lakebridge or LakeFlow to get a solution?

Are there any doc/whitepapers on this subject?


Thanks for your insights .

1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

Hello @RIDBX , right now there isn’t a dedicated Databricks connector or library for writing directly into on-prem Oracle. We don’t see this pattern very often, so there’s no built-in proprietary option today. That said, I dug through our docs and pulled together the options that do exist so you at least have a few concrete paths to consider.

What’s available today

  • Spark JDBC (Databricks → Oracle on‑prem): You can write from Databricks to Oracle using the generic Spark JDBC sink. For large volumes, it’s important to control write parallelism (for example, df.repartition(n) before .write.format("jdbc")) so you don’t overwhelm the Oracle instance and network path, and to size partitions to what Oracle can ingest efficiently.


    Example (Python):
    (df.repartition(8)  # tune to match Oracle capacity and network
       .write
       .format("jdbc")
       .option("url", "jdbc:oracle:thin:@//host:1521/service_name")
       .option("dbtable", "SCHEMA.TARGET_TABLE")
       .option("user", "<user>")
       .option("password", "<password>")
       .mode("append")
       .save())
     
  • Lakehouse Federation (Oracle → Databricks, read‑only): Databricks supports a managed way to run federated queries against Oracle without moving data into the lakehouse. You create a Unity Catalog connection + foreign catalog and then query Oracle tables from Databricks. Note: federation is read‑only (no writes back to Oracle from federation).

  • Oracle reading Databricks tables (zero‑copy): If what you want is to access Databricks tables from Oracle, Oracle is a supported consumer of Delta Sharing for read‑only access to Unity Catalog tables (zero‑copy). This lets Oracle read Databricks data without you pushing rows over JDBC in bulk.

  • Networking considerations (on‑prem Oracle): Regardless of the approach, Databricks compute must be able to reach the on‑prem listener (VPN/Direct Connect/ExpressRoute or equivalent). This is an explicit prerequisite for both JDBC and Lakehouse Federation connections.

Is JDBC the “best” option for large volumes?

It’s a solid and common option, but “best” depends on constraints:

  • If you truly need to copy large volumes into Oracle and keep doing it regularly, JDBC works but you must throttle/consolidate partitions and batch the writes so the Oracle server and network can keep up, otherwise throughput will collapse or you’ll see back‑pressure/timeouts.

    If your requirement is only for Oracle to consume the data (not to “own” it), strongly consider Delta Sharing to Oracle (read‑only, zero‑copy) to avoid expensive copy jobs and eliminate write bottlenecks on Oracle.

  • For ad hoc or operational reporting where you don’t want to ingest into Databricks first, Lakehouse Federation is ideal (Databricks reads from Oracle on demand), but again this is read‑only and not a solution for pushing data into Oracle.

About “Lakebridge” vs. Lakeflow

  • There isn’t a Databricks product called “Lakebridge.” You might be thinking of Lakehouse Federation (read‑only query over JDBC) or Lakeflow (Databricks’ unified ingestion/orchestration offering).

  • Lakeflow Connect provides managed ingestion connectors (GA for several sources like SQL Server, Salesforce, etc.). The Lakeflow roadmap explicitly includes Oracle database connectors, with Oracle highlighted in the Lakeflow launch/blog as an upcoming source; timing/features evolve over time.


    Today, if you need Databricks → Oracle loads, Lakeflow doesn’t provide a managed “reverse ETL to Oracle” sink; you’d use Spark JDBC as shown above.

Practical guidance (large volume to Oracle)

  • Start with Spark JDBC write and tune:

    • Repartition to a small, fixed number that Oracle can handle (e.g., 4–16), and size clusters accordingly.
    • If you currently have very fine‑grained partitions, coalesce before writing to reduce concurrent connections/inserts.
  • Validate network throughput and stability first (on‑prem paths can be the bottleneck even when Oracle is sized correctly).

  • If the goal is for Oracle‑side consumers to see the data, evaluate Delta Sharing so Oracle can read your Unity Catalog tables directly without copying, which often outperforms bulk copy jobs operationally and cost‑wise (read‑only).

Good docs to bookmark

Hope this helps, Louis.