topic Re: Best practices for initial large-scale ingestion from on‑premises Oracle to Databricks in Data Engineering

Best practices for initial large-scale ingestion from on‑premises Oracle to Databricks

faruko — Tue, 28 Apr 2026 08:29:41 GMT

Hello everyone,

I am responsible for designing and implementing a Lakehouse architecture in an industrial company.
I am currently facing some challenges regarding the initial ingestion of data from our on‑premise Oracle database into Databricks.

The data comes from production systems and is actively used by several applications. My main concern is that the initial load is very large, and I’m worried about impacting database performance or even causing issues if we extract all the data at once.

For the ongoing ingestion, the data volume will be much smaller and continuous, so that part is not an issue.
However, I would really appreciate advice or best practices on how to safely handle the first large‑scale ingestion (initial load) without overloading or disrupting the Oracle database.

What approaches, tools, or patterns would you recommend in this situation?

Thank you in advance for your help.

Re: Best practices for initial large-scale ingestion from on‑premises Oracle to Databricks

szymon_dybczak — Tue, 28 Apr 2026 10:52:48 GMT

Hi @faruko ,

You can split split initial load using partitioned reads. We did that approach in one of projects. So instead doing something like this:

SELECT * FROM large_table

You can do that:

SELECT * FROM table WHERE id BETWEEN 0 AND 1,000,000

With that approach you can even stop and resume loading process if you implement it correctly. Also, the best time to load data initially from database is at night where there is limited number of active users/queries.

Re: Best practices for initial large-scale ingestion from on‑premises Oracle to Databricks

faruko — Tue, 28 Apr 2026 11:22:10 GMT

Thank you for your suggestion.

Unfortunately, we do not have a unique incremental ID. Our data is identified by multiple tag_ids, with one record per tag every minute, based on a timestamp.

We initially considered using spark.readStream to load historical data month by month during low-usage periods (e.g. weekends), but we are not certain whether changing the ingestion frequency afterwards to continuous would be compatible with checkpointing and state tracking.

Re: Best practices for initial large-scale ingestion from on‑premises Oracle to Databricks

amirabedhiafi — Tue, 28 Apr 2026 18:33:39 GMT

Hi @faruko !

My idea is to treat the initial load as a controlled batch backfill then start the CDC pipeline afterwards from a clear cutoff point.

You define a fixed cutoff timestamp or Oracle SCN for the initial snapshot and later load history in small time windows for example month by month or week by week or day by day depending on volume:

WHERE event_timestamp >= :start_ts
  AND event_timestamp <  :end_ts

and since you have many tag_ids you split each time window further by tag buckets for example:

WHERE event_timestamp >= :start_ts
  AND event_timestamp <  :end_ts
  AND ORA_HASH(tag_id, 15) = :bucket

This gives you controlled parallelism without needing a unique numeric ID.

Then store the progress in a control table for example table_name, start_ts, end_ts, bucket, status, row_count, load_time this way the load restartable if one chunk fails.

And later you write into a bronze delta table with idempotent key such as (tag_id, event_timestamp) or (tag_id, event_timestamp, source_id)

Once you finishthe historical backfill up to the cutoff timestamp or SCN,you can start the incremental ingestion from that same point.

I would not try to use the same streaming checkpoint for monthly historical loading and then later change it to continuous ingestion. I would keep the initial backfill and the ongoing ingestion as 2 separate pipelines.

You can find in the doc the idea of doing initial hydration first then switching to triggered or continuous CDC processing afterwards.

https://docs.azure.cn/en-us/databricks/ldp/what-is-change-data-capture

Re: Best practices for initial large-scale ingestion from on‑premises Oracle to Databricks

faruko — Wed, 29 Apr 2026 08:02:34 GMT

Hi @amirabedhiafi,

Thank you for your answer, I found it very helpful. It actually gave me an idea.

What if I use a database backup for the initial load instead of performing the first backfill step? This way, I could ingest all the historical data at once, and then store the last insert timestamp in the bronze Delta table to use it as a starting point for the continuous ingestion.

Do you think this approach would be reliable, or could it introduce consistency issues compared to using a defined cutoff point?

Re: Best practices for initial large-scale ingestion from on‑premises Oracle to Databricks

amirabedhiafi — Wed, 29 Apr 2026 09:17:56 GMT

Hi @faruko !

Yes why not 😄 but only if the backup or export has a clear consistent cutoff point and the continuous ingestion starts from that exact point ideally based on an Oracle SCN not just whatever was in the backup. I would not rely only on the maximum insert timestamp found in the bronze tablz because timestamps can miss rows arriving late (same for updates, deletes, clock differences or rows committed after the timestamp was generated).

For your case, where the natural key seems to be something like (tag_id, event_timestamp), I would use that as the merge key or add another source side technical key if duplicates are possible.

Oracle data pump exports are only guaranteed to be consistent across all exported tables at the same point in time when you use options like FLASHBACK_SCN or FLASHBACK_TIME (it is recommendation from Oracle)

For the DBKS side, you can keep the same logic and don't forget that native DBKS lakeflow connect database connectors currently list MySQL, PostgreSQL and SQL Server but not Oracle so for Oracle CDC you may need Oracle GoldenGate or a custom CDC pipeline depending on what your company allows.