<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Best practices for initial large-scale ingestion from on‑premises Oracle to Databricks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/best-practices-for-initial-large-scale-ingestion-from-on/m-p/155676#M54294</link>
    <description>&lt;P&gt;Thank you for your suggestion.&lt;/P&gt;&lt;P&gt;Unfortunately, we do not have a unique incremental ID. Our data is identified by multiple tag_ids, with one record per tag every minute, based on a timestamp.&lt;/P&gt;&lt;P&gt;We initially considered using spark.readStream to load historical data month by month during low-usage periods (e.g. weekends), but we are not certain whether changing the ingestion frequency afterwards to continuous would be compatible with checkpointing and state tracking.&lt;/P&gt;</description>
    <pubDate>Tue, 28 Apr 2026 11:22:10 GMT</pubDate>
    <dc:creator>faruko</dc:creator>
    <dc:date>2026-04-28T11:22:10Z</dc:date>
    <item>
      <title>Best practices for initial large-scale ingestion from on‑premises Oracle to Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/best-practices-for-initial-large-scale-ingestion-from-on/m-p/155646#M54288</link>
      <description>&lt;DIV&gt;&lt;P&gt;&lt;STRONG&gt;Hello everyone,&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I am responsible for designing and implementing a Lakehouse architecture in an industrial company.&lt;BR /&gt;I am currently facing some challenges regarding the initial ingestion of data from our on‑premise Oracle database into Databricks.&lt;/P&gt;&lt;P&gt;The data comes from production systems and is actively used by several applications. My main concern is that the initial load is very large, and I’m worried about impacting database performance or even causing issues if we extract all the data at once.&lt;/P&gt;&lt;P&gt;For the ongoing ingestion, the data volume will be much smaller and continuous, so that part is not an issue.&lt;BR /&gt;However, I would really appreciate advice or best practices on how to safely handle the &lt;STRONG&gt;first large‑scale ingestion&lt;/STRONG&gt; (initial load) without overloading or disrupting the Oracle database.&lt;/P&gt;&lt;P&gt;What approaches, tools, or patterns would you recommend in this situation?&lt;/P&gt;&lt;P&gt;Thank you in advance for your help.&lt;/P&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 28 Apr 2026 08:29:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-practices-for-initial-large-scale-ingestion-from-on/m-p/155646#M54288</guid>
      <dc:creator>faruko</dc:creator>
      <dc:date>2026-04-28T08:29:41Z</dc:date>
    </item>
    <item>
      <title>Re: Best practices for initial large-scale ingestion from on‑premises Oracle to Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/best-practices-for-initial-large-scale-ingestion-from-on/m-p/155667#M54291</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/226546"&gt;@faruko&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;You can split&amp;nbsp; split initial load using partitioned reads. We did that approach in one of projects. So instead doing something like this:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;SELECT * FROM large_table&lt;/LI-CODE&gt;&lt;P&gt;You can do that:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;SELECT *
FROM table
WHERE id BETWEEN 0 AND 1,000,000&lt;/LI-CODE&gt;&lt;P&gt;With that approach you can even stop and resume loading process if you implement it correctly. Also, the best time to load data initially from database is at night where there is limited number of active users/queries.&lt;/P&gt;</description>
      <pubDate>Tue, 28 Apr 2026 10:52:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-practices-for-initial-large-scale-ingestion-from-on/m-p/155667#M54291</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2026-04-28T10:52:48Z</dc:date>
    </item>
    <item>
      <title>Re: Best practices for initial large-scale ingestion from on‑premises Oracle to Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/best-practices-for-initial-large-scale-ingestion-from-on/m-p/155676#M54294</link>
      <description>&lt;P&gt;Thank you for your suggestion.&lt;/P&gt;&lt;P&gt;Unfortunately, we do not have a unique incremental ID. Our data is identified by multiple tag_ids, with one record per tag every minute, based on a timestamp.&lt;/P&gt;&lt;P&gt;We initially considered using spark.readStream to load historical data month by month during low-usage periods (e.g. weekends), but we are not certain whether changing the ingestion frequency afterwards to continuous would be compatible with checkpointing and state tracking.&lt;/P&gt;</description>
      <pubDate>Tue, 28 Apr 2026 11:22:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-practices-for-initial-large-scale-ingestion-from-on/m-p/155676#M54294</guid>
      <dc:creator>faruko</dc:creator>
      <dc:date>2026-04-28T11:22:10Z</dc:date>
    </item>
  </channel>
</rss>

