10-18-2021 03:58 AM
Need some help in choosing between where to do deduplication of data. So I have sensor data in blob storage that I'm picking up with Databricks Autoloader. The data and files can have duplicates in them.
Which of the 2 options do I choose?
Option 1:
Option 2:
10-18-2021 05:04 AM
I would use option 2, for the simple reason that the deduped table is in fact raw data but without duplicates.
Unless you need to have a delta lake table with the duplicates?
If not, you will only create more overhead: your original data in blob storage, your bronze table with dups and your silver table without dups.
So you have 3 copies of the same schema. Option 2 reduces this to 2.
But again: unless you have a reason to keep the dups in a delta table.
PS. be aware that the merge itself can fail because of duplicates:
A merge operation can fail if multiple rows of the source dataset match and the merge attempts to update the same rows of the target Delta table. According to the SQL semantics of merge, such an update operation is ambiguous as it is unclear which source row should be used to update the matched target row.
07-17-2023 04:05 PM
I am curious what you decided to do. And also maybe what you would do now. I have some sensor data I am grabbing with an API based on a date field. The way the API is configured, there may be some overlap in the time and I want to dedup any such records.
Thanks!
07-17-2023 08:25 PM
You can use watermark to pick the late records and send only the latest records to the bronze table. This will ensure that you always have the latest information in your bronze table.
This feature is explained in detail here - https://www.databricks.com/blog/2022/08/22/feature-deep-dive-watermarking-apache-spark-structured-st...
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now