cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Handling data close to SCD2 with Delta tables

RabahO
New Contributor III

Hello, 

stack used: pyspark and delta tables

I'm working with some data that look a bit like SCD2 data.

Basically, the data has columns that represent an id, a rank column and other informations, here's an example:

login, email, business_timestamp => these fields are the "primary" key

received_date => field to sort on

account_points => other

For received_date 1:

loginemailbusiness_timestampreceived_dateaccount_points
aaaaaa@mail.com2024-01-01T00:002024-01-0110
bbbbbb@mail.com2024-01-01T00:002024-01-015
aaaaaa@mail.com2024-01-01T10:002024-01-017

 

The data may or may not change in the future, and in the case it does change it'll be duplicated on the primary key fields (login, email, business_timestamp) and to find the one we want to keep we'll sort on the received_date (the latest would be the one to keep).

For received_date 2: 

loginemailbusiness_timestampreceived_dateaccount_points
aaaaaa@mail.com2024-01-01T00:002024-01-0215

Expected result: 

 

historic view (from which we can rebuild any "best view" for a date with an sql query)

is the result of an naive append operation on the delta table everytime the ingestion is ran

loginemailbusiness_timestampreceived_dateaccount_points
aaaaaa@mail.com2024-01-01T00:002024-01-0110
bbbbbb@mail.com2024-01-01T00:002024-01-015
aaaaaa@mail.com2024-01-01T00:002024-01-0215
aaaaaa@mail.com2024-01-01T10:002024-01-017

 

best view (latest view for all business use)

is the result of a window rank function on the keys and sorting on the received date, it uses the historic view to create it.

loginemailbusiness_timestampreceived_dateaccount_points
aaaaaa@mail.com2024-01-01T00:002024-01-0215
bbbbbb@mail.com2024-01-01T00:002024-01-015
aaaaaa@mail.com2024-01-01T10:002024-01-017

 

My question is: how would you design such a system to answer two problematics:

  1. have a best view that is the singular data at the latest received_date
  2. keep a historic of the evolutions to rebuild any "best view" at any given received date <> filter

I found that time travel won't really work because we can't customize the fields on which to travel (id and timestamp aren't practical). 

I also didn't find a real "upsert" operation using pyspark.

Thank you.

1 REPLY 1

Wojciech_BUK
Contributor III

Your problem is exactly like SCD2 . You just add one more column with valid to date ( optionals you can add flag is actual to tag current records)

You can use DLT apply changes syntax. Alternatively Merge statement .

On the top of that table you can build views or function to queryy table at given timestamp.

Current state view will be with WHERE valit_to is null.

To query history at given state you just do 'your date' is between valid_from and valod_to.