Delta Live Tables use case
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-18-2024 06:31 AM
Hi all,
We have the following use case and wondering if DLT is the correct approach.
Landing area with daily dumps of parquet files into our Data Lake container.
The daily dump does a full overwrite of the parquet each time, keeping the same file name.
The idea would be to re-process the whole parquet each time and manage the delta in the bronze table with SCD 2.
Suggestions on the best approach would be helpful.
Cheers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-08-2024 11:33 PM
Using DLT for Your Use Case
DLT can be a good fit for your scenario, especially when implementing Slowly Changing Dimension (SCD) Type 2. Here's how you can approach this:
- Ingestion with Auto Loader: Use Auto Loader to ingest the daily parquet files into your bronze layer. This handles the full overwrites efficiently.
- Bronze Layer Processing: Create a bronze table using DLT that reads from the landing area.
- SCD Type 2 Implementation: Implement SCD Type 2 in the silver layer using DLT's
APPLY CHANGESsyntax.
Implementation Approach
Here's a high-level implementation strategy:
Bronze Layer:
@Dlt.table
def bronze_table():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "parquet")
.load("/path/to/landing/area")
)
Silver Layer with SCD Type 2:
dlt.create_streaming_table("silver_table_scd2")
dlt.apply_changes(
target = "silver_table_scd2",
source = "bronze_table",
keys = ["your_primary_key"],
sequence_by = col("file_modification_time"),
stored_as_scd_type = "2"
)