Databricks

BeginnerBob · ‎06-15-2022

I have a customer dimension and for every incremental load I am applying type2 or type1 to the dimension.

This dimension is based off a silver table in my delta lake where I am applying a merge statement.

What happens if I need to go back and track additional type 2 changes. Would this be possible? How would I go back and do this?

-werners- · ‎06-17-2022

So you need one or more extra columns in the SCD.

That depends: if you still have the complete change history, you could read all those files and reconstruct all the changes for the new SCD columns. But I assume you are talking about delta lake. Delta lake itself does keep history but no for very long. So when a vacuum runs, it gets cleaned.

So unless you have all the old data sitting somewhere, you cannot unfortunately.

BeginnerBob · ‎06-17-2022

Thanks werners,

I was informed you could essentially recreate a type 2 dimensions from scratch, without reading the files 1 by 1, using the delta lake time shift. However, this doesn't seem to be the case and the only way to create this is to incrementally load the data.

Thanks,

-werners- · ‎06-17-2022

that is correct: if the delta lake versions are still available, you could query the table using an older version. But that is if the version is still there. Delta lake does not keep all history.

Often it is 7 days, or 30 days depending on your configuration.

Something to check out on your workspace. If you are not too late you might be able to reconstruct it using time travel.

https://docs.databricks.com/delta/delta-batch.html#id10

You could consider deep cloning the delta table. But this comes with a cost (storage).

https://docs.databricks.com/spark/2.x/spark-sql/language-manual/clone.html

Kaniz · ‎06-23-2022

Hi @Lloyd Vickery, We haven’t heard from you on the last response from @Werner Stinckens, and I was checking back to see if his suggestions helped you. Or else, If you have any solution, please share it with the community as it can be helpful to others.

Databricks

Loading Dimensions including SCDType2

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs