topic Re: Delta Lake, CFD & SCD2 in Data Engineering

Delta Lake, CFD & SCD2

quakenbush — Thu, 21 Dec 2023 13:07:16 GMT

What's the best way to deal with SCD2-styled tables in silver and/or gold layer while streaming.

From what I've seen in the Professional Data Engineer videos, they usually go for SCD1 tables (simple updates or deletes)

In a SCD2 scenario, we need to insert a new record (postimage) and "end-date" the old record in the target. Hence, two operations are required.

As of now, I can't see how to implement that in a streaming microbatch (foreachBatch) or CDC-CDF stream. In a "classic" DWH (including Data Vault) this is an extra-step. I guess this is not applicable in a streaming/near-realtime scenario, since we would have two active records until the old one was marked invalid.

So in other words, I wonder how to:

insert the new record, using the current/processing timestamp as start_ts (easy & tought in videos)
update the old record's end_ts from the new record's start_ts - 1 of the smallest unit (eg. second or milisecond) in "classic" SQL this could be another MERGE-when-matched using LEAD

Any thoughts?

Since I am "playing around" & learning, you may assume I follow the recommended Medallion architecture. So bronze would be a multi-plexed table with kafka/debezium/json records and the second stream from bronze to silver utilizes deduplication with watermarks and the PII-stuff presented in the training 🙂

Re: Delta Lake, CFD & SCD2

Wojciech_BUK — Thu, 21 Dec 2023 22:51:48 GMT

I think you can explore DLT API "Apply Changes".
You can run it only in DLT pipeline but it can read from streming endpoint or streaming table.
Please check docs:
https://docs.databricks.com/en/delta-live-tables/cdc.html#language-python
You just include this line of code:

stored_as_scd_type = "2"

and you have your SCD2 logic done 🙂

But if you want to do clasic engineering you are right, this is MERGE with

whenMatchedUpdate

You can do it either in pysaprk or spraksql and you can do it also in delta table straming.

Please let me know if that helped or maybe I missunderstood your question.

Re: Delta Lake, CFD & SCD2

quakenbush — Fri, 22 Dec 2023 08:10:05 GMT

I did some further reading and got the same conclusion. APPLY CHANGES might to the trick. However, I don't like the limitations. From Bronze to Silver I might need .foreachBatch to implement the JSON-logic and the attribute names (__start_at / __end_at) seem to be static - this can make migrations of BI-layers (gold) hard

Anyway, I think you answered my question by confirmation. Since I'm doing this for learning & fun, I'd like to develop my own "Databricks-native Datavault 2.0"-Loader with respect of PII-handling as thought in the training 😉