<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: DLT Dedupping Best Practice in Medallion in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/dlt-dedupping-best-practice-in-medallion/m-p/100862#M40449</link>
    <description>&lt;P&gt;A typical recommendation is to not do any transformations as the data lands into the bronze layer (ELT). The idea is that you want your bronze layer to be as close of a representation of your source data as possible so if there are any mistakes later, or taking your example, the existence of duplicates is a helpful indicator of some kind, it's nice to have an accurate system of record.&lt;/P&gt;
&lt;P&gt;So in your example, raw data lands in bronze as it is and is deduplicated in the silver layer. These are not hard and fast rules - they are up to your practice.&lt;/P&gt;</description>
    <pubDate>Wed, 04 Dec 2024 08:03:28 GMT</pubDate>
    <dc:creator>cgrant</dc:creator>
    <dc:date>2024-12-04T08:03:28Z</dc:date>
    <item>
      <title>DLT Dedupping Best Practice in Medallion</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-dedupping-best-practice-in-medallion/m-p/56894#M30684</link>
      <description>&lt;P&gt;Hi there, I have what may be a deceptively simple question but I suspect may have a variety of answers:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;What is the "right" place to handle dedupping using the medallion architecture?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;In my example, I already have everything properly laid out with data arriving in a `landing` location, and I even have a DLT job that can loop through all respective source CSV &amp;gt; target DELTA tables. At the moment, I have the data come in entirely as the raw CSVs into a &lt;U&gt;&lt;STRONG&gt;bronze&lt;/STRONG&gt; &lt;/U&gt;delta table (DLT Streaming) and there is no dedupping done whatsoever here. If the same data is sent via two differently timestamped CSV's, *all* of the data will show in &lt;STRONG&gt;bronze&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;My current intent is to have all the raw data arrive in &lt;STRONG&gt;bronze&lt;/STRONG&gt;, and then I'll dedup it in a second &lt;U&gt;&lt;STRONG&gt;silver&lt;/STRONG&gt; &lt;/U&gt;delta table (DLT Streaming).&lt;/P&gt;&lt;P&gt;Does this make sense? I'm curious if others handle this the same way, or if it is more common practice to handle dedupping in the &lt;STRONG&gt;bronze&lt;/STRONG&gt; table instead?&lt;/P&gt;</description>
      <pubDate>Wed, 10 Jan 2024 20:27:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-dedupping-best-practice-in-medallion/m-p/56894#M30684</guid>
      <dc:creator>ChristianRRL</dc:creator>
      <dc:date>2024-01-10T20:27:45Z</dc:date>
    </item>
    <item>
      <title>Re: DLT Dedupping Best Practice in Medallion</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-dedupping-best-practice-in-medallion/m-p/100862#M40449</link>
      <description>&lt;P&gt;A typical recommendation is to not do any transformations as the data lands into the bronze layer (ELT). The idea is that you want your bronze layer to be as close of a representation of your source data as possible so if there are any mistakes later, or taking your example, the existence of duplicates is a helpful indicator of some kind, it's nice to have an accurate system of record.&lt;/P&gt;
&lt;P&gt;So in your example, raw data lands in bronze as it is and is deduplicated in the silver layer. These are not hard and fast rules - they are up to your practice.&lt;/P&gt;</description>
      <pubDate>Wed, 04 Dec 2024 08:03:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-dedupping-best-practice-in-medallion/m-p/100862#M40449</guid>
      <dc:creator>cgrant</dc:creator>
      <dc:date>2024-12-04T08:03:28Z</dc:date>
    </item>
    <item>
      <title>Re: DLT Dedupping Best Practice in Medallion</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-dedupping-best-practice-in-medallion/m-p/101423#M40657</link>
      <description>&lt;P&gt;1. Deduplication in medallion architecture can be handled in bronze or silver layer.&lt;BR /&gt;2. If keeping a complete history of all raw data, including duplicates, in the bronze layer, handle deduplication in the silver layer.&lt;BR /&gt;3. If not keeping a complete history of all raw data, including duplicates, in the bronze layer, handle deduplication in the bronze layer to reduce data processed in the silver layer.&lt;BR /&gt;4. Consider data volume, performance, cost, and data quality when deciding where to handle deduplication.&lt;BR /&gt;5. In your use case, handling deduplication in the silver layer is valid, but consider moving it to the bronze layer if processing a large amount of duplicate data in the silver layer.&lt;/P&gt;</description>
      <pubDate>Mon, 09 Dec 2024 07:46:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-dedupping-best-practice-in-medallion/m-p/101423#M40657</guid>
      <dc:creator>Sidhant07</dc:creator>
      <dc:date>2024-12-09T07:46:12Z</dc:date>
    </item>
  </channel>
</rss>

