cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Handling data close to SCD2 with Delta tables

RabahO
New Contributor III

Hello, 

stack used: pyspark and delta tables

I'm working with some data that look a bit like SCD2 data.

Basically, the data has columns that represent an id, a rank column and other informations, here's an example:

login, email, business_timestamp => these fields are the "primary" key

received_date => field to sort on

account_points => other

For received_date 1:

loginemailbusiness_timestampreceived_dateaccount_points
aaaaaa@mail.com2024-01-01T00:002024-01-0110
bbbbbb@mail.com2024-01-01T00:002024-01-015
aaaaaa@mail.com2024-01-01T10:002024-01-017

 

The data may or may not change in the future, and in the case it does change it'll be duplicated on the primary key fields (login, email, business_timestamp) and to find the one we want to keep we'll sort on the received_date (the latest would be the one to keep).

For received_date 2: 

loginemailbusiness_timestampreceived_dateaccount_points
aaaaaa@mail.com2024-01-01T00:002024-01-0215

Expected result: 

 

historic view (from which we can rebuild any "best view" for a date with an sql query)

is the result of an naive append operation on the delta table everytime the ingestion is ran

loginemailbusiness_timestampreceived_dateaccount_points
aaaaaa@mail.com2024-01-01T00:002024-01-0110
bbbbbb@mail.com2024-01-01T00:002024-01-015
aaaaaa@mail.com2024-01-01T00:002024-01-0215
aaaaaa@mail.com2024-01-01T10:002024-01-017

 

best view (latest view for all business use)

is the result of a window rank function on the keys and sorting on the received date, it uses the historic view to create it.

loginemailbusiness_timestampreceived_dateaccount_points
aaaaaa@mail.com2024-01-01T00:002024-01-0215
bbbbbb@mail.com2024-01-01T00:002024-01-015
aaaaaa@mail.com2024-01-01T10:002024-01-017

 

My question is: how would you design such a system to answer two problematics:

  1. have a best view that is the singular data at the latest received_date
  2. keep a historic of the evolutions to rebuild any "best view" at any given received date <> filter

I found that time travel won't really work because we can't customize the fields on which to travel (id and timestamp aren't practical). 

I also didn't find a real "upsert" operation using pyspark.

Thank you.

1 REPLY 1

Wojciech_BUK
Valued Contributor III

Your problem is exactly like SCD2 . You just add one more column with valid to date ( optionals you can add flag is actual to tag current records)

You can use DLT apply changes syntax. Alternatively Merge statement .

On the top of that table you can build views or function to queryy table at given timestamp.

Current state view will be with WHERE valit_to is null.

To query history at given state you just do 'your date' is between valid_from and valod_to.

 

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!