cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

DLT SCD type 2 in bronze, silver and gold? Is it possible?

andreasmherman
New Contributor II

I have a question related to when we are using  #DLT. Let me try to describe the DLT problem: 

Objective: 

  • Process data end-to-end (bronze, silver gold) using DLT
  • Want bronze to hold a complete raw replica of the data, leveraging apply_changes SCD to write new data
  • Want silver to hold a complete state of the curated data, leveraging apply_changes SCD to write new data
  • Want gold to hold a complete state of the use case specific data, leveraging apply_changes SCD to write new data
  • Use auto loader to load data from landing into bronze

Problem:

Not possible to use apply_changes (e.g., SCD type 2) with DLT in several data layers (bronze, silver, gold).

If I use SCD in bronze, I cannot use it in silver or gold for instance, etc.

Reason:

  • skipChangeCommits needs to be used to be able to stream from a table created with apply_changes, otherwise the job crashes because it is not allowed to stream from a changed table.
  • skipChangeCommits ignores every form of change (delete, change, even created new rows with same keys as existing old rows)
    • It would be understandable if skipChangeCommits ignores SCD type 1 changes, where the row is updated itself, but not SCD type 2โ€™s new rows where new actual rows are created
  • I managed to โ€œgo around thisโ€ by streaming from e.g., bronze (write with apply_changes) into a silver table with a separate DLT pipeline (i.e., one for bronze, one for silver). Then I could use ignoreChanges instead of skipChangeCommits, and stream from bronze as a normal delta table (not live table). Then both tables (bronze and silver) updates as they should, and both uses apply_changes SCD type 2.
    • This is good. However, it requires one to have multiple DLT pipelines for each layer and/or every time one wants an SCD table as inputโ€ฆ
    • Recent Databricks documentation suggests one to use skipChangeCommits instead of ignoreChanges, which is interesting.

If one cannot use SCD type 2 in multiple layers, then all upstream tables must be โ€œappend onlyโ€ stream tables, which is not too appealing. Especially when trying to build a data warehouse with the possibility to choose what data to consume from what layer, and not having to re-implement processing.

Appreciate guidance if this is not supported, or how one is supposed to incrementally load data into all tables end-to-end (bronze, silver, gold).

4 REPLIES 4

thedatacrew
New Contributor III

Hi,

I'm no expert in DLT, and I've only been using it for a few months. My reading of docs so far suggests streaming into bronze all your raw data using an autoloader and then using apply_changes into (SCD1/2) in silver.

Finally, in the gold layer, use materialized views; as you correctly pointed out, you can't use appy_changes target as a source for another SCD1/2 target. DLT is quite opinionated in its approach, which is probably not a bad thing.

For everything else, there is always structured streaming, CDC foreachBatch() etc.

See: https://docs.databricks.com/en/delta-live-tables/cdc.html#

Regards

Toby.
https://thedatacrew.com

andreasmherman
New Contributor II

Hi Toby,

Thanks for your reply! Much appreciated and interesting to hear your view.

In terms of streaming to bronze without CDC. It would essentially be an append-stream, right? What about if you have a dimension table that processes delta data and you want to update the state of the table, incrementally. If one would append-only, it could be a bit clunky, as it it would be harder to query that bronze table's latest state of data (without using CDC). Hence, wouldn't it be preferable to be able to maintain an easily accessible complete latest state of the bronze layer too (and not an append-only bronze table)?

Thinking on the gold layer:

1. If you would use foreachBatch(), it would be done outside of the DLT pipeline, if I am not wrong? Essentially run as a separate Databricks job. I think it would be more efficient to aim to rely on the same technology approach (DLT) through out if possible I'd say.

2. Materialised views are a bit limited in its support currently. E.g., requires Unity Catalog, intended for Databricks managed queries, not supporting surrogate keys which can be nice for reporting, seems to be only for SQL currently.

What are you thoughts?

Best regards,

Andreas

 

Hey, 

I'm going through the same thought process as you. Almost word for word. Curious what path you went down for your project? I'm trying to not have every streaming project be a huge code base, but DLT doesn't get enough of what I want done, specifically incremental from bronze all the way to gold. 

thedatacrew
New Contributor III

Hi,

At the moment, my process is:-

I am using ETL (Data Factory) to land paquet files in a raw landing zone. I keep all the source data here. So I can fully rebuild the data if I need to.

i.e.

source_system/schema/table/loadon_year=2024/loadon_month=08/loadedon_day=01

I mount this landing location as a volume in databricks (Unity Catalog)

1. I ingest the data into the bronze tables using Cloudfiles as a stream append-only. I iterate over a list of tables & locations from metadata. My Bronze table targets my catalog and bronze schema and has a DLT Pipeline setup to process them.

2. I read in the bronze tables as a streaming source for the silver layer and use APPLY INTO to another table as (SCD1/SCD2) as required. This will update any existing records SCD or record history as SCD2.

a. Silver layer tables target my catalog and silver schema and have a separate DLT Pipeline setup to proess them. At the moment you can't target a destination to a different schema in the same DLT pipeline with unity catalog.

3. For the Gold layer, I create a live table (materialised view) which performs all the joins and field selections from silver etc.

a. Gold layer tables target my catalog and gold schema and have a separate DLT Pipeline setup to process them

4. For surrogate keys, I use an md5 hash of the business keys and source system id. That way I don't have to do complex key lookups. The surrogate key pipeline is generated in the Sliver layer.

I have not hit any issues yet, although it is still early days. If you overcomplicate the processing between bronze, silver & gold, it can be a bit tricky.

I want to have a single DLT pipeline, but that would require DTL to be able to cross Catalog & Schema boundaries. Unless, of course, you have it all in one schema, which I don't like.

Regards

Toby.
https://thedatacrew.com

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group