Hello,
stack used: pyspark and delta tables
I'm working with some data that look a bit like SCD2 data.
Basically, the data has columns that represent an id, a rank column and other informations, here's an example:
login, email, business_timestamp => these fields are the "primary" key
received_date => field to sort on
account_points => other
For received_date 1:
login | email | business_timestamp | received_date | account_points |
aaa | aaa@mail.com | 2024-01-01T00:00 | 2024-01-01 | 10 |
bbb | bbb@mail.com | 2024-01-01T00:00 | 2024-01-01 | 5 |
aaa | aaa@mail.com | 2024-01-01T10:00 | 2024-01-01 | 7 |
The data may or may not change in the future, and in the case it does change it'll be duplicated on the primary key fields (login, email, business_timestamp) and to find the one we want to keep we'll sort on the received_date (the latest would be the one to keep).
For received_date 2:
login | email | business_timestamp | received_date | account_points |
aaa | aaa@mail.com | 2024-01-01T00:00 | 2024-01-02 | 15 |
Expected result:
historic view (from which we can rebuild any "best view" for a date with an sql query)
is the result of an naive append operation on the delta table everytime the ingestion is ran
login | email | business_timestamp | received_date | account_points |
aaa | aaa@mail.com | 2024-01-01T00:00 | 2024-01-01 | 10 |
bbb | bbb@mail.com | 2024-01-01T00:00 | 2024-01-01 | 5 |
aaa | aaa@mail.com | 2024-01-01T00:00 | 2024-01-02 | 15 |
aaa | aaa@mail.com | 2024-01-01T10:00 | 2024-01-01 | 7 |
best view (latest view for all business use)
is the result of a window rank function on the keys and sorting on the received date, it uses the historic view to create it.
login | email | business_timestamp | received_date | account_points |
aaa | aaa@mail.com | 2024-01-01T00:00 | 2024-01-02 | 15 |
bbb | bbb@mail.com | 2024-01-01T00:00 | 2024-01-01 | 5 |
aaa | aaa@mail.com | 2024-01-01T10:00 | 2024-01-01 | 7 |
My question is: how would you design such a system to answer two problematics:
- have a best view that is the singular data at the latest received_date
- keep a historic of the evolutions to rebuild any "best view" at any given received date <> filter
I found that time travel won't really work because we can't customize the fields on which to travel (id and timestamp aren't practical).
I also didn't find a real "upsert" operation using pyspark.
Thank you.