cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Loading Dimensions including SCDType2

BeginnerBob
New Contributor III

I have a customer dimension and for every incremental load I am applying type2 or type1 to the dimension.

This dimension is based off a silver table in my delta lake where I am applying a merge statement.

What happens if I need to go back and track additional type 2 changes. Would this be possible? How would I go back and do this?

4 REPLIES 4

-werners-
Esteemed Contributor III

So you need one or more extra columns in the SCD.

That depends: if you still have the complete change history, you could read all those files and reconstruct all the changes for the new SCD columns. But I assume you are talking about delta lake. Delta lake itself does keep history but no for very long. So when a vacuum runs, it gets cleaned.

So unless you have all the old data sitting somewhere, you cannot unfortunately.

BeginnerBob
New Contributor III

Thanks werners,

I was informed you could essentially recreate a type 2 dimensions from scratch, without reading the files 1 by 1, using the delta lake time shift. However, this doesn't seem to be the case and the only way to create this is to incrementally load the data.

Thanks,

-werners-
Esteemed Contributor III

that is correct: if the delta lake versions are still available, you could query the table using an older version. But that is if the version is still there. Delta lake does not keep all history.

Often it is 7 days, or 30 days depending on your configuration.

Something to check out on your workspace. If you are not too late you might be able to reconstruct it using time travel.

https://docs.databricks.com/delta/delta-batch.html#id10

You could consider deep cloning the delta table. But this comes with a cost (storage).

https://docs.databricks.com/spark/2.x/spark-sql/language-manual/clone.html

Kaniz
Community Manager
Community Manager

Hi @Lloyd Vickery​, We haven’t heard from you on the last response from @Werner Stinckens​, and I was checking back to see if his suggestions helped you. Or else, If you have any solution, please share it with the community as it can be helpful to others.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.