cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Deduplication, Bronze (raw) or Silver (enriched)

baatchus
New Contributor III

Need some help in choosing between where to do deduplication of data. So I have sensor data in blob storage that I'm picking up with Databricks Autoloader. The data and files can have duplicates in them.

Which of the 2 options do I choose?

Option 1:

  • Create a Bronze (Raw) Delta Lake table which reads from the files with Autoloader and only appends data.
  • Create a Silver (Enriched) Delta Lake table that reads from Bronze table and does merge to deduplicate?
  • Create a Silver (Enriched) Delta Lake that reads from the first Silver table and joins with another table.

Option 2:

  • Create a Bronze (Raw) Delta Lake table which reads from the files with Autoloader and does merge into to deduplicate
  • Create a Silver (Enriched) Delta Lake table with reads from the first Silver table and joins with another table.

3 REPLIES 3

-werners-
Esteemed Contributor III

I would use option 2, for the simple reason that the deduped table is in fact raw data but without duplicates.

Unless you need to have a delta lake table with the duplicates?

If not, you will only create more overhead: your original data in blob storage, your bronze table with dups and your silver table without dups.

So you have 3 copies of the same schema. Option 2 reduces this to 2.

But again: unless you have a reason to keep the dups in a delta table.

PS. be aware that the merge itself can fail because of duplicates:

A merge operation can fail if multiple rows of the source dataset match and the merge attempts to update the same rows of the target Delta table. According to the SQL semantics of merge, such an update operation is ambiguous as it is unclear which source row should be used to update the matched target row.

peter_mcnally
New Contributor III

I am curious what you decided to do. And also maybe what you would do now. I have some sensor data I am grabbing with an API based on a date field. The way the API is configured, there may be some overlap in the time and I want to dedup any such records.

Thanks!

Tharun-Kumar
Honored Contributor II
Honored Contributor II

@peter_mcnally 

You can use watermark to pick the late records and send only the latest records to the bronze table. This will ensure that you always have the latest information in your bronze table.

This feature is explained in detail here - https://www.databricks.com/blog/2022/08/22/feature-deep-dive-watermarking-apache-spark-structured-st...

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.