<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Optimal process for loading data where the full dataset is provided every day? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/optimal-process-for-loading-data-where-the-full-dataset-is/m-p/76473#M35223</link>
    <description>&lt;P&gt;In case of full loads, you only need to pass&lt;/P&gt;&lt;PRE&gt;  .mode('overwrite')&lt;/PRE&gt;&lt;P&gt;while writing to your Bronze table. This is not related to Auto Loader.&lt;/P&gt;</description>
    <pubDate>Tue, 02 Jul 2024 08:41:03 GMT</pubDate>
    <dc:creator>Witold</dc:creator>
    <dc:date>2024-07-02T08:41:03Z</dc:date>
    <item>
      <title>Optimal process for loading data where the full dataset is provided every day?</title>
      <link>https://community.databricks.com/t5/data-engineering/optimal-process-for-loading-data-where-the-full-dataset-is/m-p/76466#M35222</link>
      <description>&lt;P&gt;We receive several datasets where the full dump is delivered daily or weekly. What is the best way to ingest this into Databricks using DLT or basic PySpark while adhering to the medallion?&lt;/P&gt;&lt;P&gt;1. If we use AutoLoader into Bronze, We'd end up with incrementing the bronze table with 100,000 rows evey day (with 99% duplicates).&lt;/P&gt;&lt;P&gt;How would we then move changes or additions downstream?&lt;/P&gt;</description>
      <pubDate>Tue, 02 Jul 2024 08:25:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimal-process-for-loading-data-where-the-full-dataset-is/m-p/76466#M35222</guid>
      <dc:creator>oakhill</dc:creator>
      <dc:date>2024-07-02T08:25:18Z</dc:date>
    </item>
    <item>
      <title>Re: Optimal process for loading data where the full dataset is provided every day?</title>
      <link>https://community.databricks.com/t5/data-engineering/optimal-process-for-loading-data-where-the-full-dataset-is/m-p/76473#M35223</link>
      <description>&lt;P&gt;In case of full loads, you only need to pass&lt;/P&gt;&lt;PRE&gt;  .mode('overwrite')&lt;/PRE&gt;&lt;P&gt;while writing to your Bronze table. This is not related to Auto Loader.&lt;/P&gt;</description>
      <pubDate>Tue, 02 Jul 2024 08:41:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimal-process-for-loading-data-where-the-full-dataset-is/m-p/76473#M35223</guid>
      <dc:creator>Witold</dc:creator>
      <dc:date>2024-07-02T08:41:03Z</dc:date>
    </item>
    <item>
      <title>Re: Optimal process for loading data where the full dataset is provided every day?</title>
      <link>https://community.databricks.com/t5/data-engineering/optimal-process-for-loading-data-where-the-full-dataset-is/m-p/76511#M35236</link>
      <description>&lt;P&gt;Won't this cause troubles with CDC in the silver layer because the entire dataset is new? Or will it remember what lines from the bronze it already has read even though it's overwritten?&lt;/P&gt;</description>
      <pubDate>Tue, 02 Jul 2024 13:34:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimal-process-for-loading-data-where-the-full-dataset-is/m-p/76511#M35236</guid>
      <dc:creator>oakhill</dc:creator>
      <dc:date>2024-07-02T13:34:32Z</dc:date>
    </item>
    <item>
      <title>Re: Optimal process for loading data where the full dataset is provided every day?</title>
      <link>https://community.databricks.com/t5/data-engineering/optimal-process-for-loading-data-where-the-full-dataset-is/m-p/76514#M35238</link>
      <description>&lt;P&gt;Depends on your logic. The source sends you a full load, this might mean that you need to reprocess everything, also in all downstream layers.&lt;/P&gt;&lt;P&gt;If the source only sends you a full load, because it's not capable of identifying changes then you should do CDC as early as possible. And usually a MERGE INTO with merge-conditions and update-conditions will help you.&lt;/P&gt;</description>
      <pubDate>Tue, 02 Jul 2024 13:44:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimal-process-for-loading-data-where-the-full-dataset-is/m-p/76514#M35238</guid>
      <dc:creator>Witold</dc:creator>
      <dc:date>2024-07-02T13:44:56Z</dc:date>
    </item>
    <item>
      <title>Re: Optimal process for loading data where the full dataset is provided every day?</title>
      <link>https://community.databricks.com/t5/data-engineering/optimal-process-for-loading-data-where-the-full-dataset-is/m-p/76614#M35281</link>
      <description>&lt;P&gt;Agree with&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/107959"&gt;@Witold&lt;/a&gt;&amp;nbsp;to apply CDC as early as possible. Depending on where the initial files get deposited, I'd recommend having an initial raw layer to your medallion which is just your cloud storage account - so each day or week the files get deposited here. From there you can pull it into your Bronze layer using MERGE INTO to only pull the new / latest data&lt;/P&gt;</description>
      <pubDate>Wed, 03 Jul 2024 12:22:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimal-process-for-loading-data-where-the-full-dataset-is/m-p/76614#M35281</guid>
      <dc:creator>dbrx_user</dc:creator>
      <dc:date>2024-07-03T12:22:13Z</dc:date>
    </item>
  </channel>
</rss>

