cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Deduplication, Bronze (raw) or Silver (enriched)

baatchus
New Contributor III

Need some help in choosing between where to do deduplication of data. So I have sensor data in blob storage that I'm picking up with Databricks Autoloader. The data and files can have duplicates in them.

Which of the 2 options do I choose?

Option 1:

  • Create a Bronze (Raw) Delta Lake table which reads from the files with Autoloader and only appends data.
  • Create a Silver (Enriched) Delta Lake table that reads from Bronze table and does merge to deduplicate?
  • Create a Silver (Enriched) Delta Lake that reads from the first Silver table and joins with another table.

Option 2:

  • Create a Bronze (Raw) Delta Lake table which reads from the files with Autoloader and does merge into to deduplicate
  • Create a Silver (Enriched) Delta Lake table with reads from the first Silver table and joins with another table.

3 REPLIES 3

-werners-
Esteemed Contributor III

I would use option 2, for the simple reason that the deduped table is in fact raw data but without duplicates.

Unless you need to have a delta lake table with the duplicates?

If not, you will only create more overhead: your original data in blob storage, your bronze table with dups and your silver table without dups.

So you have 3 copies of the same schema. Option 2 reduces this to 2.

But again: unless you have a reason to keep the dups in a delta table.

PS. be aware that the merge itself can fail because of duplicates:

A merge operation can fail if multiple rows of the source dataset match and the merge attempts to update the same rows of the target Delta table. According to the SQL semantics of merge, such an update operation is ambiguous as it is unclear which source row should be used to update the matched target row.

peter_mcnally
New Contributor III

I am curious what you decided to do. And also maybe what you would do now. I have some sensor data I am grabbing with an API based on a date field. The way the API is configured, there may be some overlap in the time and I want to dedup any such records.

Thanks!

Tharun-Kumar
Databricks Employee
Databricks Employee

@peter_mcnally 

You can use watermark to pick the late records and send only the latest records to the bronze table. This will ensure that you always have the latest information in your bronze table.

This feature is explained in detail here - https://www.databricks.com/blog/2022/08/22/feature-deep-dive-watermarking-apache-spark-structured-st...

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group