cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Lake, CFD & SCD2

quakenbush
Contributor

Hi

What's the best way to deal with SCD2-styled tables in silver and/or gold layer while streaming.

From what I've seen in the Professional Data Engineer videos, they usually go for SCD1 tables (simple updates or deletes)

In a SCD2 scenario, we need to insert a new record (postimage) and "end-date" the old record in the target. Hence, two operations are required.

As of now, I can't see how to implement that in a streaming microbatch (foreachBatch) or CDC-CDF stream. In a "classic" DWH (including Data Vault) this is an extra-step. I guess this is not applicable in a streaming/near-realtime scenario, since we would have two active records until the old one was marked invalid.

So in other words, I wonder how to:

  1. insert the new record, using the current/processing timestamp as start_ts (easy & tought in videos)
  2. update the old record's end_ts from the new record's start_ts - 1 of the smallest unit (eg. second or milisecond) in "classic" SQL this could be another MERGE-when-matched using LEAD

Any thoughts?

Since I am "playing around" & learning, you may assume I follow the recommended Medallion architecture. So bronze would be a multi-plexed table with kafka/debezium/json records and the second stream from bronze to silver utilizes deduplication with watermarks and the PII-stuff presented in the training 🙂

2 REPLIES 2

Wojciech_BUK
Contributor III

I think you can explore DLT API "Apply Changes". 
You can run it only in DLT pipeline but it can read from streming endpoint or streaming table.
Please check docs:
https://docs.databricks.com/en/delta-live-tables/cdc.html#language-python
You just include this line of code:

 

  stored_as_scd_type = "2"

 

and you have your SCD2 logic done 🙂 

But if you want to do clasic engineering you are right, this is MERGE with 

 

whenMatchedUpdate

 

You can do it either in pysaprk or spraksql and you can do it also in delta table straming.

Please let me know if that helped or maybe I missunderstood your question. 


quakenbush
Contributor

I did some further reading and got the same conclusion. APPLY CHANGES might to the trick. However, I don't like the limitations. From Bronze to Silver I might need .foreachBatch to implement the JSON-logic and the attribute names (__start_at / __end_at) seem to be static - this can make migrations of BI-layers (gold) hard

Anyway, I think you answered my question by confirmation. Since I'm doing this for learning & fun, I'd like to develop my own "Databricks-native Datavault 2.0"-Loader with respect of PII-handling as thought in the training 😉

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.