Hi there, I have what may be a deceptively simple question but I suspect may have a variety of answers:
- What is the "right" place to handle dedupping using the medallion architecture?
In my example, I already have everything properly laid out with data arriving in a `landing` location, and I even have a DLT job that can loop through all respective source CSV > target DELTA tables. At the moment, I have the data come in entirely as the raw CSVs into a bronze delta table (DLT Streaming) and there is no dedupping done whatsoever here. If the same data is sent via two differently timestamped CSV's, *all* of the data will show in bronze.
My current intent is to have all the raw data arrive in bronze, and then I'll dedup it in a second silver delta table (DLT Streaming).
Does this make sense? I'm curious if others handle this the same way, or if it is more common practice to handle dedupping in the bronze table instead?