Loading Dimensions including SCDType2
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-15-2022 01:02 PM
I have a customer dimension and for every incremental load I am applying type2 or type1 to the dimension.
This dimension is based off a silver table in my delta lake where I am applying a merge statement.
What happens if I need to go back and track additional type 2 changes. Would this be possible? How would I go back and do this?
- Labels:
-
ETL
-
Loading
-
Type Changes
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-17-2022 02:21 AM
So you need one or more extra columns in the SCD.
That depends: if you still have the complete change history, you could read all those files and reconstruct all the changes for the new SCD columns. But I assume you are talking about delta lake. Delta lake itself does keep history but no for very long. So when a vacuum runs, it gets cleaned.
So unless you have all the old data sitting somewhere, you cannot unfortunately.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-17-2022 02:26 AM
Thanks werners,
I was informed you could essentially recreate a type 2 dimensions from scratch, without reading the files 1 by 1, using the delta lake time shift. However, this doesn't seem to be the case and the only way to create this is to incrementally load the data.
Thanks,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-17-2022 02:31 AM
that is correct: if the delta lake versions are still available, you could query the table using an older version. But that is if the version is still there. Delta lake does not keep all history.
Often it is 7 days, or 30 days depending on your configuration.
Something to check out on your workspace. If you are not too late you might be able to reconstruct it using time travel.
https://docs.databricks.com/delta/delta-batch.html#id10
You could consider deep cloning the delta table. But this comes with a cost (storage).
https://docs.databricks.com/spark/2.x/spark-sql/language-manual/clone.html

