cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT Dedupping Best Practice in Medallion

ChristianRRL
Contributor III

Hi there, I have what may be a deceptively simple question but I suspect may have a variety of answers:

  • What is the "right" place to handle dedupping using the medallion architecture?

In my example, I already have everything properly laid out with data arriving in a `landing` location, and I even have a DLT job that can loop through all respective source CSV > target DELTA tables. At the moment, I have the data come in entirely as the raw CSVs into a bronze delta table (DLT Streaming) and there is no dedupping done whatsoever here. If the same data is sent via two differently timestamped CSV's, *all* of the data will show in bronze.

My current intent is to have all the raw data arrive in bronze, and then I'll dedup it in a second silver delta table (DLT Streaming).

Does this make sense? I'm curious if others handle this the same way, or if it is more common practice to handle dedupping in the bronze table instead?

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz_Fatma
Community Manager
Community Manager

Hi @ChristianRRL

Your approach to handling deduplication in the Silver layer of the Medallion architecture is quite common and aligns with the general principles of this architecture.

In the Medallion architecture, data flows through different layers, each with a specific purpose:

  1. Bronze Layer (Raw Data): This is where all the data from external source systems lands. The focus in this layer is quick Change Data Capture and the ability to provide an historical archiv.... So, it’s common to have duplicate data in this layer if the same data is sent via two differently timestamped CSVs.

  2. Silver Layer (Cleansed and Conformed Data): In this layer, the data from the Bronze layer is matched...1. This is typically where deduplication would occur, as you’ve planned. The Silver layer provides an “Enterprise view” of all its key business entities, concepts, and trans....

So, your approach to deduplication in the Silver layer aligns with the typical use of the Medallion architecture.

View solution in original post

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @ChristianRRL

Your approach to handling deduplication in the Silver layer of the Medallion architecture is quite common and aligns with the general principles of this architecture.

In the Medallion architecture, data flows through different layers, each with a specific purpose:

  1. Bronze Layer (Raw Data): This is where all the data from external source systems lands. The focus in this layer is quick Change Data Capture and the ability to provide an historical archiv.... So, it’s common to have duplicate data in this layer if the same data is sent via two differently timestamped CSVs.

  2. Silver Layer (Cleansed and Conformed Data): In this layer, the data from the Bronze layer is matched...1. This is typically where deduplication would occur, as you’ve planned. The Silver layer provides an “Enterprise view” of all its key business entities, concepts, and trans....

So, your approach to deduplication in the Silver layer aligns with the typical use of the Medallion architecture.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group