cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Lake, CFD & SCD2

quakenbush
Contributor

Hi

What's the best way to deal with SCD2-styled tables in silver and/or gold layer while streaming.

From what I've seen in the Professional Data Engineer videos, they usually go for SCD1 tables (simple updates or deletes)

In a SCD2 scenario, we need to insert a new record (postimage) and "end-date" the old record in the target. Hence, two operations are required.

As of now, I can't see how to implement that in a streaming microbatch (foreachBatch) or CDC-CDF stream. In a "classic" DWH (including Data Vault) this is an extra-step. I guess this is not applicable in a streaming/near-realtime scenario, since we would have two active records until the old one was marked invalid.

So in other words, I wonder how to:

  1. insert the new record, using the current/processing timestamp as start_ts (easy & tought in videos)
  2. update the old record's end_ts from the new record's start_ts - 1 of the smallest unit (eg. second or milisecond) in "classic" SQL this could be another MERGE-when-matched using LEAD

Any thoughts?

Since I am "playing around" & learning, you may assume I follow the recommended Medallion architecture. So bronze would be a multi-plexed table with kafka/debezium/json records and the second stream from bronze to silver utilizes deduplication with watermarks and the PII-stuff presented in the training 🙂

2 REPLIES 2

Wojciech_BUK
Valued Contributor III

I think you can explore DLT API "Apply Changes". 
You can run it only in DLT pipeline but it can read from streming endpoint or streaming table.
Please check docs:
https://docs.databricks.com/en/delta-live-tables/cdc.html#language-python
You just include this line of code:

 

  stored_as_scd_type = "2"

 

and you have your SCD2 logic done 🙂 

But if you want to do clasic engineering you are right, this is MERGE with 

 

whenMatchedUpdate

 

You can do it either in pysaprk or spraksql and you can do it also in delta table straming.

Please let me know if that helped or maybe I missunderstood your question. 


quakenbush
Contributor

I did some further reading and got the same conclusion. APPLY CHANGES might to the trick. However, I don't like the limitations. From Bronze to Silver I might need .foreachBatch to implement the JSON-logic and the attribute names (__start_at / __end_at) seem to be static - this can make migrations of BI-layers (gold) hard

Anyway, I think you answered my question by confirmation. Since I'm doing this for learning & fun, I'd like to develop my own "Databricks-native Datavault 2.0"-Loader with respect of PII-handling as thought in the training 😉

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!