โ01-18-2024 10:10 AM
I have a question related to when we are using #DLT. Let me try to describe the DLT problem:
Objective:
Problem:
Not possible to use apply_changes (e.g., SCD type 2) with DLT in several data layers (bronze, silver, gold).
If I use SCD in bronze, I cannot use it in silver or gold for instance, etc.
Reason:
If one cannot use SCD type 2 in multiple layers, then all upstream tables must be โappend onlyโ stream tables, which is not too appealing. Especially when trying to build a data warehouse with the possibility to choose what data to consume from what layer, and not having to re-implement processing.
Appreciate guidance if this is not supported, or how one is supposed to incrementally load data into all tables end-to-end (bronze, silver, gold).
โ01-18-2024 01:00 PM
Hi,
I'm no expert in DLT, and I've only been using it for a few months. My reading of docs so far suggests streaming into bronze all your raw data using an autoloader and then using apply_changes into (SCD1/2) in silver.
Finally, in the gold layer, use materialized views; as you correctly pointed out, you can't use appy_changes target as a source for another SCD1/2 target. DLT is quite opinionated in its approach, which is probably not a bad thing.
For everything else, there is always structured streaming, CDC foreachBatch() etc.
See: https://docs.databricks.com/en/delta-live-tables/cdc.html#
Regards
โ01-19-2024 07:06 AM
Hi Toby,
Thanks for your reply! Much appreciated and interesting to hear your view.
In terms of streaming to bronze without CDC. It would essentially be an append-stream, right? What about if you have a dimension table that processes delta data and you want to update the state of the table, incrementally. If one would append-only, it could be a bit clunky, as it it would be harder to query that bronze table's latest state of data (without using CDC). Hence, wouldn't it be preferable to be able to maintain an easily accessible complete latest state of the bronze layer too (and not an append-only bronze table)?
Thinking on the gold layer:
1. If you would use foreachBatch(), it would be done outside of the DLT pipeline, if I am not wrong? Essentially run as a separate Databricks job. I think it would be more efficient to aim to rely on the same technology approach (DLT) through out if possible I'd say.
2. Materialised views are a bit limited in its support currently. E.g., requires Unity Catalog, intended for Databricks managed queries, not supporting surrogate keys which can be nice for reporting, seems to be only for SQL currently.
What are you thoughts?
Best regards,
Andreas
โ10-15-2024 08:20 AM
Hey,
I'm going through the same thought process as you. Almost word for word. Curious what path you went down for your project? I'm trying to not have every streaming project be a huge code base, but DLT doesn't get enough of what I want done, specifically incremental from bronze all the way to gold.
โ10-15-2024 08:49 AM - edited โ10-15-2024 08:51 AM
Hi,
At the moment, my process is:-
I am using ETL (Data Factory) to land paquet files in a raw landing zone. I keep all the source data here. So I can fully rebuild the data if I need to.
i.e.
source_system/schema/table/loadon_year=2024/loadon_month=08/loadedon_day=01
I mount this landing location as a volume in databricks (Unity Catalog)
1. I ingest the data into the bronze tables using Cloudfiles as a stream append-only. I iterate over a list of tables & locations from metadata. My Bronze table targets my catalog and bronze schema and has a DLT Pipeline setup to process them.
2. I read in the bronze tables as a streaming source for the silver layer and use APPLY INTO to another table as (SCD1/SCD2) as required. This will update any existing records SCD or record history as SCD2.
a. Silver layer tables target my catalog and silver schema and have a separate DLT Pipeline setup to proess them. At the moment you can't target a destination to a different schema in the same DLT pipeline with unity catalog.
3. For the Gold layer, I create a live table (materialised view) which performs all the joins and field selections from silver etc.
a. Gold layer tables target my catalog and gold schema and have a separate DLT Pipeline setup to process them
4. For surrogate keys, I use an md5 hash of the business keys and source system id. That way I don't have to do complex key lookups. The surrogate key pipeline is generated in the Sliver layer.
I have not hit any issues yet, although it is still early days. If you overcomplicate the processing between bronze, silver & gold, it can be a bit tricky.
I want to have a single DLT pipeline, but that would require DTL to be able to cross Catalog & Schema boundaries. Unless, of course, you have it all in one schema, which I don't like.
Regards
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group