cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

DLT SCD type 2 in bronze, silver and gold? Is it possible?

andreasmherman
New Contributor II

I have a question related to when we are using  #DLT. Let me try to describe the DLT problem: 

Objective: 

  • Process data end-to-end (bronze, silver gold) using DLT
  • Want bronze to hold a complete raw replica of the data, leveraging apply_changes SCD to write new data
  • Want silver to hold a complete state of the curated data, leveraging apply_changes SCD to write new data
  • Want gold to hold a complete state of the use case specific data, leveraging apply_changes SCD to write new data
  • Use auto loader to load data from landing into bronze

Problem:

Not possible to use apply_changes (e.g., SCD type 2) with DLT in several data layers (bronze, silver, gold).

If I use SCD in bronze, I cannot use it in silver or gold for instance, etc.

Reason:

  • skipChangeCommits needs to be used to be able to stream from a table created with apply_changes, otherwise the job crashes because it is not allowed to stream from a changed table.
  • skipChangeCommits ignores every form of change (delete, change, even created new rows with same keys as existing old rows)
    • It would be understandable if skipChangeCommits ignores SCD type 1 changes, where the row is updated itself, but not SCD type 2โ€™s new rows where new actual rows are created
  • I managed to โ€œgo around thisโ€ by streaming from e.g., bronze (write with apply_changes) into a silver table with a separate DLT pipeline (i.e., one for bronze, one for silver). Then I could use ignoreChanges instead of skipChangeCommits, and stream from bronze as a normal delta table (not live table). Then both tables (bronze and silver) updates as they should, and both uses apply_changes SCD type 2.
    • This is good. However, it requires one to have multiple DLT pipelines for each layer and/or every time one wants an SCD table as inputโ€ฆ
    • Recent Databricks documentation suggests one to use skipChangeCommits instead of ignoreChanges, which is interesting.

If one cannot use SCD type 2 in multiple layers, then all upstream tables must be โ€œappend onlyโ€ stream tables, which is not too appealing. Especially when trying to build a data warehouse with the possibility to choose what data to consume from what layer, and not having to re-implement processing.

Appreciate guidance if this is not supported, or how one is supposed to incrementally load data into all tables end-to-end (bronze, silver, gold).

2 REPLIES 2

thedatacrew
New Contributor III

Hi,

I'm no expert in DLT, and I've only been using it for a few months. My reading of docs so far suggests streaming into bronze all your raw data using an autoloader and then using apply_changes into (SCD1/2) in silver.

Finally, in the gold layer, use materialized views; as you correctly pointed out, you can't use appy_changes target as a source for another SCD1/2 target. DLT is quite opinionated in its approach, which is probably not a bad thing.

For everything else, there is always structured streaming, CDC foreachBatch() etc.

See: https://docs.databricks.com/en/delta-live-tables/cdc.html#

Regards

Toby.
https://thedatacrew.com

andreasmherman
New Contributor II

Hi Toby,

Thanks for your reply! Much appreciated and interesting to hear your view.

In terms of streaming to bronze without CDC. It would essentially be an append-stream, right? What about if you have a dimension table that processes delta data and you want to update the state of the table, incrementally. If one would append-only, it could be a bit clunky, as it it would be harder to query that bronze table's latest state of data (without using CDC). Hence, wouldn't it be preferable to be able to maintain an easily accessible complete latest state of the bronze layer too (and not an append-only bronze table)?

Thinking on the gold layer:

1. If you would use foreachBatch(), it would be done outside of the DLT pipeline, if I am not wrong? Essentially run as a separate Databricks job. I think it would be more efficient to aim to rely on the same technology approach (DLT) through out if possible I'd say.

2. Materialised views are a bit limited in its support currently. E.g., requires Unity Catalog, intended for Databricks managed queries, not supporting surrogate keys which can be nice for reporting, seems to be only for SQL currently.

What are you thoughts?

Best regards,

Andreas

 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.