Databricks Community

AWRIGHT · ‎05-09-2024

Hi

I’m new to data modelling so could use some help.

I’m building a personal project using a fairly standard 3NF sales database as the source data.

So far I have a pipeline that incrementally extracts data from the source system each day into a Raw storage layer, using a watermarking technique. The format here is:

raw/<table name>/<current date>/export.csv

My end goal is a Star Schema (I’ve already designed this).

I’d like the next part of my pipeline to load into Silver (cleaning up column names etc), and then load into my dimensions & facts within gold layer.

However, struggling with the logic of it, and what my silver layer should look like. I’m just conscious that in my gold layer I need to retain history with SCD2.

I believe some people just upsert to Silver (update existing rows and append new rows). Then truncate/load into Gold Layer, but I think this would make it impossible to track SCD2.

Im getting a bit muddled with all of this so would appreciate some help.

koushiknpvs · ‎05-09-2024

This is what my medallion architecture looks like -

1) Bronze Layer - append raw data.

2) Silver Layer, reflect current(active) data and I do business logic transformations. The Silver layer should serve as your cleaned and transformed staging area. Here, data from the raw layer is sanitized, column names are standardized, and other transformations needed to prepare the data for analytical processing are applied. This layer often doesn't maintain historical changes—it’s more about reflecting the current, cleaned state of incoming data.

3) Gold layer - Moving to the Gold layer with a focus on dimensional modeling and implementing SCD2, your strategy needs to capture and store historical changes for analysis:

Dimension Tables with SCD2: This involves tracking the full history of data changes. Each dimension record should have metadata columns like start_date, end_date, and an is_current flag to indicate whether the record holds the current state.
Fact Tables: These typically store transactional data and should link to the dimension tables via foreign keys.

AWRIGHT · ‎05-10-2024

Thank you for the response.

From a high level it makes sense. It's the details I'm a bit unclear on.

So when a single pipeline runs (let's say today - 10/05/2024) for your architecture it might:

Extract records from source system tables that have changed since last time pipeline ran in Bronze layer. Records will either have an additional attribute added (e.g., extraction date) or be stored in a folder with the current date)
Extract only records from Bronze that have extract date = 10/05/2024 or are in the 10-05-2024 folder, cleans them up and upserts them to the silver layer (existing records overwritten with latest change and new records appended). So sliver layer is basically just a copy of the source system at this point, except it's cleaned up.

Assuming what I've said above is correct, I'm unsure how to go about loading into a Gold dimensional model.

Initial thought was to filter silver tables by extract/ingestion date of record, join relevant tables, and merge into Gold (taking into account start_date etc). However, let's say I need to join an Employees and Office table, but for 10-05-2024 only the Employees table was updated, then the join will fail as there will be no Office data to append to (due to first filtering the data by 10-05-2024). So an incremental approach doesn't really seem possible.

Is the only way to simply try merge all the Silver data into Gold?

I know these might seem like silly question, just want to make sure I'm building my pipeline correctly

AWRIGHT · ‎05-10-2024

Thank you for the response.

From a high level it makes sense. It's the details I'm a bit unclear on.

So when a single pipeline runs (let's say today - 10/05/2024) for your architecture it might:

Extract records from source system tables that have changed since last time pipeline ran in Bronze layer. Records will either have an additional attribute added (e.g., extraction date) or be stored in a folder with the current date)
Extract only records from Bronze that have extract date = 10/05/2024 or are in the 10-05-2024 folder, cleans them up and upserts them to the silver layer (existing records overwritten with latest change and new records appended). So sliver layer is basically just a copy of the source system at this point, except it's cleaned up.
Assuming what I've said above is correct, I'm unsure how to go about loading into a Gold dimensional model.

Initial thought was to filter silver tables by extract/ingestion date of record, join relevant tables, and merge into Gold (taking into account start_date etc). However, let's say I need to join an Employees and Office table, but for 10-05-2024 only the Employees table was updated, then the join will fail as there will be no Office data to append to (due to first filtering the data by 10-05-2024). So an incremental approach doesn't really seem possible.

Is the only way to simply try merge all the Silver data into Gold?

I know these might seem like silly question, just want to make sure I'm building my pipeline correctly

AWRIGHT · ‎05-10-2024

Thank you for the response.

From a high level it makes sense. It's the details I'm a bit unclear on.

So when a single pipeline runs (let's say today - 10/05/2024) for your architecture it might:

Extract records from source system tables that have changed since last time pipeline ran in Bronze layer. Records will either have an additional attribute added (e.g., extraction date) or be stored in a folder with the current date)
Extract only records from Bronze that have extract date = 10/05/2024 or are in the 10-05-2024 folder, cleans them up and upserts them to the silver layer (existing records overwritten with latest change and new records appended). So sliver layer is basically just a copy of the source system at this point, except it's cleaned up.
Assuming what I've said above is correct, I'm unsure how to go about loading into a Gold dimensional model.

Initial thought was to filter silver tables by extract/ingestion date of record, join relevant tables, and merge into Gold (taking into account start_date etc). However, let's say I need to join an Employees and Office table, but for 10-05-2024 only the Employees table was updated, then the join will fail as there will be no Office data to append to (due to first filtering the data by 10-05-2024). So an incremental approach doesn't really seem possible.

Is the only way to simply try merge all the Silver data into Gold?

I know these might seem like silly question, just want to make sure I'm building my pipeline correctly

AWRIGHT · ‎05-10-2024

Thank you.

From a high level it makes sense. It's the details I'm a bit unclear on.

So when a single pipeline runs (let's say today - 10/05/2024) for your architecture it might:

Extract records from source system tables that have changed since last time pipeline ran in Bronze layer. Records will either have an additional attribute added (e.g., extraction date) or be stored in a folder with the current date.
Extract only records from Bronze that have extract date = 10/05/2024 or are in the 10-05-2024 folder, cleans them up and upserts them to the silver layer (existing records overwritten with latest change and new records appended). So sliver layer is basically just a copy of the source system at this point, except it's cleaned up.

Assuming what I've said above is correct, I'm unsure how to go about loading into a Gold dimensional model.

Initial thought was to filter silver tables by extract/ingestion date of record, join relevant tables, and merge into Gold (taking into account start_date etc).

However, let's say I need to join an Employees and Office table, but for 10-05-2024 only the Employees table was updated, then the join will fail as there will be no Office data to append to (due to first filtering the data by 10-05-2024). So an incremental approach doesn't really seem possible.

Is the only way to simply try merge all the Silver data into Gold?