Databricks Community

andreasmherman · ‎01-18-2024

I have a question related to when we are using #DLT. Let me try to describe the DLT problem:

Objective:

Process data end-to-end (bronze, silver gold) using DLT
Want bronze to hold a complete raw replica of the data, leveraging apply_changes SCD to write new data
Want silver to hold a complete state of the curated data, leveraging apply_changes SCD to write new data
Want gold to hold a complete state of the use case specific data, leveraging apply_changes SCD to write new data
Use auto loader to load data from landing into bronze

Problem:

Not possible to use apply_changes (e.g., SCD type 2) with DLT in several data layers (bronze, silver, gold).

If I use SCD in bronze, I cannot use it in silver or gold for instance, etc.

Reason:

skipChangeCommits needs to be used to be able to stream from a table created with apply_changes, otherwise the job crashes because it is not allowed to stream from a changed table.
skipChangeCommits ignores every form of change (delete, change, even created new rows with same keys as existing old rows)

It would be understandable if skipChangeCommits ignores SCD type 1 changes, where the row is updated itself, but not SCD type 2’s new rows where new actual rows are created

I managed to “go around this” by streaming from e.g., bronze (write with apply_changes) into a silver table with a separate DLT pipeline (i.e., one for bronze, one for silver). Then I could use ignoreChanges instead of skipChangeCommits, and stream from bronze as a normal delta table (not live table). Then both tables (bronze and silver) updates as they should, and both uses apply_changes SCD type 2.

This is good. However, it requires one to have multiple DLT pipelines for each layer and/or every time one wants an SCD table as input…
Recent Databricks documentation suggests one to use skipChangeCommits instead of ignoreChanges, which is interesting.

If one cannot use SCD type 2 in multiple layers, then all upstream tables must be “append only” stream tables, which is not too appealing. Especially when trying to build a data warehouse with the possibility to choose what data to consume from what layer, and not having to re-implement processing.

Appreciate guidance if this is not supported, or how one is supposed to incrementally load data into all tables end-to-end (bronze, silver, gold).

thedatacrew · ‎01-18-2024

Hi,

I'm no expert in DLT, and I've only been using it for a few months. My reading of docs so far suggests streaming into bronze all your raw data using an autoloader and then using apply_changes into (SCD1/2) in silver.

Finally, in the gold layer, use materialized views; as you correctly pointed out, you can't use appy_changes target as a source for another SCD1/2 target. DLT is quite opinionated in its approach, which is probably not a bad thing.

For everything else, there is always structured streaming, CDC foreachBatch() etc.

See: https://docs.databricks.com/en/delta-live-tables/cdc.html#

Regards

Toby.
https://thedatacrew.com

andreasmherman · ‎01-19-2024

Hi Toby,

Thanks for your reply! Much appreciated and interesting to hear your view.

In terms of streaming to bronze without CDC. It would essentially be an append-stream, right? What about if you have a dimension table that processes delta data and you want to update the state of the table, incrementally. If one would append-only, it could be a bit clunky, as it it would be harder to query that bronze table's latest state of data (without using CDC). Hence, wouldn't it be preferable to be able to maintain an easily accessible complete latest state of the bronze layer too (and not an append-only bronze table)?

Thinking on the gold layer:

1. If you would use foreachBatch(), it would be done outside of the DLT pipeline, if I am not wrong? Essentially run as a separate Databricks job. I think it would be more efficient to aim to rely on the same technology approach (DLT) through out if possible I'd say.

2. Materialised views are a bit limited in its support currently. E.g., requires Unity Catalog, intended for Databricks managed queries, not supporting surrogate keys which can be nice for reporting, seems to be only for SQL currently.

What are you thoughts?

Best regards,

Andreas

comfortfalcon · ‎10-15-2024

Hey,

I'm going through the same thought process as you. Almost word for word. Curious what path you went down for your project? I'm trying to not have every streaming project be a huge code base, but DLT doesn't get enough of what I want done, specifically incremental from bronze all the way to gold.

thedatacrew · ‎10-15-2024

Hi,

At the moment, my process is:-

I am using ETL (Data Factory) to land paquet files in a raw landing zone. I keep all the source data here. So I can fully rebuild the data if I need to.

i.e.

source_system/schema/table/loadon_year=2024/loadon_month=08/loadedon_day=01

I mount this landing location as a volume in databricks (Unity Catalog)

1. I ingest the data into the bronze tables using Cloudfiles as a stream append-only. I iterate over a list of tables & locations from metadata. My Bronze table targets my catalog and bronze schema and has a DLT Pipeline setup to process them.

2. I read in the bronze tables as a streaming source for the silver layer and use APPLY INTO to another table as (SCD1/SCD2) as required. This will update any existing records SCD or record history as SCD2.

a. Silver layer tables target my catalog and silver schema and have a separate DLT Pipeline setup to proess them. At the moment you can't target a destination to a different schema in the same DLT pipeline with unity catalog.

3. For the Gold layer, I create a live table (materialised view) which performs all the joins and field selections from silver etc.

a. Gold layer tables target my catalog and gold schema and have a separate DLT Pipeline setup to process them

4. For surrogate keys, I use an md5 hash of the business keys and source system id. That way I don't have to do complex key lookups. The surrogate key pipeline is generated in the Sliver layer.

I have not hit any issues yet, although it is still early days. If you overcomplicate the processing between bronze, silver & gold, it can be a bit tricky.

I want to have a single DLT pipeline, but that would require DTL to be able to cross Catalog & Schema boundaries. Unless, of course, you have it all in one schema, which I don't like.

Regards

Toby.
https://thedatacrew.com