Databricks

baatchus · ‎10-18-2021

Need some help in choosing between where to do deduplication of data. So I have sensor data in blob storage that I'm picking up with Databricks Autoloader. The data and files can have duplicates in them.

Which of the 2 options do I choose?

Option 1:

Create a Bronze (Raw) Delta Lake table which reads from the files with Autoloader and only appends data.
Create a Silver (Enriched) Delta Lake table that reads from Bronze table and does merge to deduplicate?
Create a Silver (Enriched) Delta Lake that reads from the first Silver table and joins with another table.

Option 2:

Create a Bronze (Raw) Delta Lake table which reads from the files with Autoloader and does merge into to deduplicate
Create a Silver (Enriched) Delta Lake table with reads from the first Silver table and joins with another table.

-werners- · ‎10-18-2021

I would use option 2, for the simple reason that the deduped table is in fact raw data but without duplicates.

Unless you need to have a delta lake table with the duplicates?

If not, you will only create more overhead: your original data in blob storage, your bronze table with dups and your silver table without dups.

So you have 3 copies of the same schema. Option 2 reduces this to 2.

But again: unless you have a reason to keep the dups in a delta table.

PS. be aware that the merge itself can fail because of duplicates:

A merge operation can fail if multiple rows of the source dataset match and the merge attempts to update the same rows of the target Delta table. According to the SQL semantics of merge, such an update operation is ambiguous as it is unclear which source row should be used to update the matched target row.

peter_mcnally · ‎07-17-2023

I am curious what you decided to do. And also maybe what you would do now. I have some sensor data I am grabbing with an API based on a date field. The way the API is configured, there may be some overlap in the time and I want to dedup any such records.

Thanks!

Tharun-Kumar · ‎07-17-2023

@peter_mcnally

You can use watermark to pick the late records and send only the latest records to the bronze table. This will ensure that you always have the latest information in your bronze table.

This feature is explained in detail here - https://www.databricks.com/blog/2022/08/22/feature-deep-dive-watermarking-apache-spark-structured-st...

Databricks

Deduplication, Bronze (raw) or Silver (enriched)

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs