Databricks Community

quakenbush · ‎12-21-2023

Hi

What's the best way to deal with SCD2-styled tables in silver and/or gold layer while streaming.

From what I've seen in the Professional Data Engineer videos, they usually go for SCD1 tables (simple updates or deletes)

In a SCD2 scenario, we need to insert a new record (postimage) and "end-date" the old record in the target. Hence, two operations are required.

As of now, I can't see how to implement that in a streaming microbatch (foreachBatch) or CDC-CDF stream. In a "classic" DWH (including Data Vault) this is an extra-step. I guess this is not applicable in a streaming/near-realtime scenario, since we would have two active records until the old one was marked invalid.

So in other words, I wonder how to:

insert the new record, using the current/processing timestamp as start_ts (easy & tought in videos)
update the old record's end_ts from the new record's start_ts - 1 of the smallest unit (eg. second or milisecond) in "classic" SQL this could be another MERGE-when-matched using LEAD

Any thoughts?

Since I am "playing around" & learning, you may assume I follow the recommended Medallion architecture. So bronze would be a multi-plexed table with kafka/debezium/json records and the second stream from bronze to silver utilizes deduplication with watermarks and the PII-stuff presented in the training 🙂

Wojciech_BUK · ‎12-21-2023

I think you can explore DLT API "Apply Changes".
You can run it only in DLT pipeline but it can read from streming endpoint or streaming table.
Please check docs:
https://docs.databricks.com/en/delta-live-tables/cdc.html#language-python
You just include this line of code:

  stored_as_scd_type = "2"

and you have your SCD2 logic done 🙂

But if you want to do clasic engineering you are right, this is MERGE with

whenMatchedUpdate

You can do it either in pysaprk or spraksql and you can do it also in delta table straming.

Please let me know if that helped or maybe I missunderstood your question.

quakenbush · ‎12-22-2023

I did some further reading and got the same conclusion. APPLY CHANGES might to the trick. However, I don't like the limitations. From Bronze to Silver I might need .foreachBatch to implement the JSON-logic and the attribute names (__start_at / __end_at) seem to be static - this can make migrations of BI-layers (gold) hard

Anyway, I think you answered my question by confirmation. Since I'm doing this for learning & fun, I'd like to develop my own "Databricks-native Datavault 2.0"-Loader with respect of PII-handling as thought in the training 😉

Databricks Community

Delta Lake, CFD & SCD2

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences