<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic DLT Medallion Incremental Ingestion Pattern Approach in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/dlt-medallion-incremental-ingestion-pattern-approach/m-p/56876#M6367</link>
    <description>&lt;P&gt;Hi there, I have a question regarding what would be the "recommended" incremental ingestion approach using DLT to pull raw landing data into bronze and then silver?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The original approach I've been considering is to have raw CSV files arrive in a &lt;SPAN&gt;landing&lt;/SPAN&gt;&amp;nbsp;dbfs path and ingest it into a &lt;SPAN&gt;bronze&lt;/SPAN&gt;&amp;nbsp;`streaming` table (even though it's triggered to run 1-2 times a day). This bronze table would have ALL the raw data ever submitted, regardless of whether it has duplicates or not. Immediately downstream a &lt;SPAN&gt;silver&lt;/SPAN&gt; `streaming` table would deduplicate the data and ensure that the data types are set accordingly. Below is code for a single DLT bronze `streaming` table as I'm meaning to ingest it:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;lt;at-symbol&amp;gt;dlt.table&lt;/P&gt;&lt;P&gt;def bronze_table_name():&lt;/P&gt;&lt;P&gt;return (&lt;/P&gt;&lt;P&gt;spark.readStream.format("cloudFiles")&lt;/P&gt;&lt;P&gt;.option("header", "true")&lt;/P&gt;&lt;P&gt;.option("cloudFiles.format", "csv")&lt;/P&gt;&lt;P&gt;.option("inferSchema", "true")&lt;/P&gt;&lt;P&gt;.option("cloudFiles.partitionColumns", "project_id")&lt;/P&gt;&lt;P&gt;.load(f"{dataset_path}/{table_name}")&lt;/P&gt;&lt;P&gt;.select("*", "_metadata.file_name")&lt;/P&gt;&lt;P&gt;)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Alternatively, I've noticed a slightly different pattern that has &lt;SPAN&gt;bronze&lt;/SPAN&gt; as a view rather than a table, and then both dedupping and data type enforcement are handled in the &lt;SPAN&gt;silver&lt;/SPAN&gt; table.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I would appreciate feedback on this matter. In my instance we have some very big tables, so I'm not sure if/when the second approach would make any sense for us since I'm assuming that a bronze view would take more and more "querying" time as the raw datasets keep growing, whereas with my original approach it would only ever process the new raw data on a daily run rather than querying the entire dataset.&lt;/P&gt;</description>
    <pubDate>Wed, 10 Jan 2024 17:07:05 GMT</pubDate>
    <dc:creator>ChristianRRL</dc:creator>
    <dc:date>2024-01-10T17:07:05Z</dc:date>
    <item>
      <title>DLT Medallion Incremental Ingestion Pattern Approach</title>
      <link>https://community.databricks.com/t5/get-started-discussions/dlt-medallion-incremental-ingestion-pattern-approach/m-p/56876#M6367</link>
      <description>&lt;P&gt;Hi there, I have a question regarding what would be the "recommended" incremental ingestion approach using DLT to pull raw landing data into bronze and then silver?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The original approach I've been considering is to have raw CSV files arrive in a &lt;SPAN&gt;landing&lt;/SPAN&gt;&amp;nbsp;dbfs path and ingest it into a &lt;SPAN&gt;bronze&lt;/SPAN&gt;&amp;nbsp;`streaming` table (even though it's triggered to run 1-2 times a day). This bronze table would have ALL the raw data ever submitted, regardless of whether it has duplicates or not. Immediately downstream a &lt;SPAN&gt;silver&lt;/SPAN&gt; `streaming` table would deduplicate the data and ensure that the data types are set accordingly. Below is code for a single DLT bronze `streaming` table as I'm meaning to ingest it:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;lt;at-symbol&amp;gt;dlt.table&lt;/P&gt;&lt;P&gt;def bronze_table_name():&lt;/P&gt;&lt;P&gt;return (&lt;/P&gt;&lt;P&gt;spark.readStream.format("cloudFiles")&lt;/P&gt;&lt;P&gt;.option("header", "true")&lt;/P&gt;&lt;P&gt;.option("cloudFiles.format", "csv")&lt;/P&gt;&lt;P&gt;.option("inferSchema", "true")&lt;/P&gt;&lt;P&gt;.option("cloudFiles.partitionColumns", "project_id")&lt;/P&gt;&lt;P&gt;.load(f"{dataset_path}/{table_name}")&lt;/P&gt;&lt;P&gt;.select("*", "_metadata.file_name")&lt;/P&gt;&lt;P&gt;)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Alternatively, I've noticed a slightly different pattern that has &lt;SPAN&gt;bronze&lt;/SPAN&gt; as a view rather than a table, and then both dedupping and data type enforcement are handled in the &lt;SPAN&gt;silver&lt;/SPAN&gt; table.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I would appreciate feedback on this matter. In my instance we have some very big tables, so I'm not sure if/when the second approach would make any sense for us since I'm assuming that a bronze view would take more and more "querying" time as the raw datasets keep growing, whereas with my original approach it would only ever process the new raw data on a daily run rather than querying the entire dataset.&lt;/P&gt;</description>
      <pubDate>Wed, 10 Jan 2024 17:07:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/dlt-medallion-incremental-ingestion-pattern-approach/m-p/56876#M6367</guid>
      <dc:creator>ChristianRRL</dc:creator>
      <dc:date>2024-01-10T17:07:05Z</dc:date>
    </item>
    <item>
      <title>Re: DLT Medallion Incremental Ingestion Pattern Approach</title>
      <link>https://community.databricks.com/t5/get-started-discussions/dlt-medallion-incremental-ingestion-pattern-approach/m-p/57505#M6369</link>
      <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;, thank you for the feedback. I think this is very helpful, and I really like the CDC example. I would like to double-check though, did you mean to send the same link out for both examples? I think the CDC example link applies more to your first paragraph rather than the second (correct me if I'm wrong). If you meant to provide a different link example for the second paragraph can you please edit your last comment or add a new one below? +Also, I can't tell if the last two links are meant to be other pages as well.&lt;/P&gt;&lt;P&gt;Thanks again!&lt;/P&gt;</description>
      <pubDate>Tue, 16 Jan 2024 19:08:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/dlt-medallion-incremental-ingestion-pattern-approach/m-p/57505#M6369</guid>
      <dc:creator>ChristianRRL</dc:creator>
      <dc:date>2024-01-16T19:08:06Z</dc:date>
    </item>
  </channel>
</rss>

