Databricks Community

RabahO · ‎12-26-2023

Hello,

stack used: pyspark and delta tables

I'm working with some data that look a bit like SCD2 data.

Basically, the data has columns that represent an id, a rank column and other informations, here's an example:

login, email, business_timestamp => these fields are the "primary" key

received_date => field to sort on

account_points => other

For received_date 1:

login	email	business_timestamp	received_date	account_points
aaa	aaa@mail.com	2024-01-01T00:00	2024-01-01	10
bbb	bbb@mail.com	2024-01-01T00:00	2024-01-01	5
aaa	aaa@mail.com	2024-01-01T10:00	2024-01-01	7

The data may or may not change in the future, and in the case it does change it'll be duplicated on the primary key fields (login, email, business_timestamp) and to find the one we want to keep we'll sort on the received_date (the latest would be the one to keep).

For received_date 2:

login	email	business_timestamp	received_date	account_points
aaa	aaa@mail.com	2024-01-01T00:00	2024-01-02	15

Expected result:

historic view (from which we can rebuild any "best view" for a date with an sql query)

is the result of an naive append operation on the delta table everytime the ingestion is ran

login	email	business_timestamp	received_date	account_points
aaa	aaa@mail.com	2024-01-01T00:00	2024-01-01	10
bbb	bbb@mail.com	2024-01-01T00:00	2024-01-01	5
aaa	aaa@mail.com	2024-01-01T00:00	2024-01-02	15
aaa	aaa@mail.com	2024-01-01T10:00	2024-01-01	7

best view (latest view for all business use)

is the result of a window rank function on the keys and sorting on the received date, it uses the historic view to create it.

login	email	business_timestamp	received_date	account_points
aaa	aaa@mail.com	2024-01-01T00:00	2024-01-02	15
bbb	bbb@mail.com	2024-01-01T00:00	2024-01-01	5
aaa	aaa@mail.com	2024-01-01T10:00	2024-01-01	7

My question is: how would you design such a system to answer two problematics:

have a best view that is the singular data at the latest received_date
keep a historic of the evolutions to rebuild any "best view" at any given received date <> filter

I found that time travel won't really work because we can't customize the fields on which to travel (id and timestamp aren't practical).

I also didn't find a real "upsert" operation using pyspark.

Thank you.

Wojciech_BUK · ‎12-26-2023

Your problem is exactly like SCD2 . You just add one more column with valid to date ( optionals you can add flag is actual to tag current records)

You can use DLT apply changes syntax. Alternatively Merge statement .

On the top of that table you can build views or function to queryy table at given timestamp.

Current state view will be with WHERE valit_to is null.

To query history at given state you just do 'your date' is between valid_from and valod_to.

Databricks Community

Handling data close to SCD2 with Delta tables

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences