cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Loading Dimensions including SCDType2

BeginnerBob
New Contributor III

I have a customer dimension and for every incremental load I am applying type2 or type1 to the dimension.

This dimension is based off a silver table in my delta lake where I am applying a merge statement.

What happens if I need to go back and track additional type 2 changes. Would this be possible? How would I go back and do this?

3 REPLIES 3

-werners-
Esteemed Contributor III

So you need one or more extra columns in the SCD.

That depends: if you still have the complete change history, you could read all those files and reconstruct all the changes for the new SCD columns. But I assume you are talking about delta lake. Delta lake itself does keep history but no for very long. So when a vacuum runs, it gets cleaned.

So unless you have all the old data sitting somewhere, you cannot unfortunately.

BeginnerBob
New Contributor III

Thanks werners,

I was informed you could essentially recreate a type 2 dimensions from scratch, without reading the files 1 by 1, using the delta lake time shift. However, this doesn't seem to be the case and the only way to create this is to incrementally load the data.

Thanks,

-werners-
Esteemed Contributor III

that is correct: if the delta lake versions are still available, you could query the table using an older version. But that is if the version is still there. Delta lake does not keep all history.

Often it is 7 days, or 30 days depending on your configuration.

Something to check out on your workspace. If you are not too late you might be able to reconstruct it using time travel.

https://docs.databricks.com/delta/delta-batch.html#id10

You could consider deep cloning the delta table. But this comes with a cost (storage).

https://docs.databricks.com/spark/2.x/spark-sql/language-manual/clone.html

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group