Hi
What's the best way to deal with SCD2-styled tables in silver and/or gold layer while streaming.
From what I've seen in the Professional Data Engineer videos, they usually go for SCD1 tables (simple updates or deletes)
In a SCD2 scenario, we need to insert a new record (postimage) and "end-date" the old record in the target. Hence, two operations are required.
As of now, I can't see how to implement that in a streaming microbatch (foreachBatch) or CDC-CDF stream. In a "classic" DWH (including Data Vault) this is an extra-step. I guess this is not applicable in a streaming/near-realtime scenario, since we would have two active records until the old one was marked invalid.
So in other words, I wonder how to:
- insert the new record, using the current/processing timestamp as start_ts (easy & tought in videos)
- update the old record's end_ts from the new record's start_ts - 1 of the smallest unit (eg. second or milisecond) in "classic" SQL this could be another MERGE-when-matched using LEAD
Any thoughts?
Since I am "playing around" & learning, you may assume I follow the recommended Medallion architecture. So bronze would be a multi-plexed table with kafka/debezium/json records and the second stream from bronze to silver utilizes deduplication with watermarks and the PII-stuff presented in the training 🙂