Incremental load from two tables

garciargs · ‎01-17-2025

Hi,

I am looking to build a ETL process for a incremental load silver table.

This silver table, lets say "contracts_silver", is built by joining two bronze tables, "contracts_raw" and "customer".

contracts_silver

CONTRACT_ID	STATUS	CUSTOMER_NAME
1	SIGNED	Peter Smith
2	SIGNED	John Smith

contracts_raw

ID	STATUS	DATE	CUSTOMER_ID
1	SIGNED	2025-01-15	1
2	SIGNED	2025-01-15	2

customer_raw

ID	NAME	DOB
1	Peter Smith	2025-01-15
2	John Smith	2025-01-15

The "contracts_raw" table will grow faster than "customer_raw".

Updates are supposed to be in batch. In any update, I can have inserts, updates or deletes on both raw tables.

Considering that at any given time I can have updates happening on both raw tables, or only one of then, is databricks capable auomatically detecting the need to update the "contracts_silver" table if:

Update 1: in contracts_raw, ID 1 is changed to STATUS cancelled
Update 2: in customer_raw, ID2, name is changed to John J. Smith

The goal is to always reprocess, but only rows that will eventually change in the silver table.