Databricks Community

_TJ · ‎02-08-2024

Small introduction; I'm a BI/Data developer, mostly working in MSSQL & DataFactory, coming from SSIS. Now I'm trying Databricks, to see if it works for me and my customers. I got enthusiastic by the video; https://www.youtube.com/watch?v=PIFL7W3DmaY&t=4652s, now I'm on the waiting list for streaming tables and materialized views. My apologies for probably some newbe questions, but sometimes it's hard to find the right works to search:)

I'm building a simplified PoC, with Sales Order Lines for an ERP system.

I'm loading parquet files in Azure ADLS, structure:

- Source system
-->Table (Orderlines)

---->Delta

------>2024-02-02

------>2024-02-03

---->Full

------>2024-02-01

Mostly once a week a full load, daily incremental.

I was planning to use a STREAMING TABLE with APPLY_CHANGES to load the data. Now I'm wondering how to configure the 'merge/upsert' part, in the following two scenarios; we can do an incremental load from the source system in two ways: by date (fetch all order lines from the last 5 days) or by ID (fetch all order lines for the order IDs that have been changed the last 5 days). The problem lies in how to handle deletes in the source system.

For the first case, it must delete all records after the given dates, that are not presented in the incremental load

For the second case, it must delete all records from the order IDs that are loaded in the incremental load but do not exist anymore.

Currently, in SQL we do this, with a bit of pseudo-code, but I hope you get the idea.

Case 1 - By date:

SELECT @date = MIN(Date) FROM Source

DELETE FROM Target WHERE Date > @date

INSERT INTO Target SELECT * FROM Source

Case 2 = By id

DELETE FROM Target WHERE ID IN (SELECT ID from Source)

INSERT INTO Target SELECT * FROM Source

Target stays up-to-date, and records that have been removed are deleted.

Is this possible in DLT? I can't find it in the documentation, only that the keys must be unique; which is not the case.

I'm hoping that I'm missing something crucial since it's a fairly common issue I believe: incremental loading from a source system where - sometimes - rows will be deleted. Can anyone point me in the right direction?

_TJ · ‎02-14-2024

Hi @Retired_mod

First of all, thanks for your comprehansive answer. Although I appreciate your input, I'm missing the answer to address my problem: I don't want to delete all data from the previous load, only the data that 'overlaps' my new load;

Case 1
In this case, I load data from the source system using a 'sliding' date window: I'm fetching all rows from (let say) the last month, because rows can change or removed in this month. So I want to keep everything allready loaded prior to the last month and delete everything from last month and further (because these are in my new load).

Case 2
Same for this case: I don't want to delete the older ID's that no longer exists: They do exist, but they are not in my icremental data load anymore: I'm only fetching the order line for the ID's that are changed. So it must keep all ID's from prior load, and delete all IDs that are in the new incremental load.

How can DLT handles these deletes? I can't see how a synthetic key could help in these cases.

Do you understand my problem? Otherwise, I will prepare some samples.