02-08-2024 03:02 AM
- Source system
-->Table (Orderlines)
02-14-2024 01:54 AM
Hi @_TJ, Thank you for sharing your background and your interest in Databricks! Let’s dive into your scenarios and explore how you can achieve similar functionality using Databricks Delta Live Tables (DLT).
Streaming Tables and Materialized Views:
Handling Incremental Loads and Deletes:
Case 1: Incremental Load by Date
To achieve this, you can create a streaming table that ingests data from your source system based on the date. You’ll use the APPLY_CHANGES
operation to merge the changes into your target table.
When new data arrives, DLT will automatically apply the changes (inserts, updates, and deletes) to your target table. Records with dates older than the given date will be deleted from the target table.
Here’s a high-level example of how you might set this up:
-- Create a streaming table
CREATE STREAMING TABLE my_streaming_table
USING PARQUET
OPTIONS (
'path' = '/path/to/parquet/files'
);
-- Apply changes to the target table
APPLY CHANGES TO my_target_table
FROM my_streaming_table;
Case 2: Incremental Load by ID
For this scenario, you’ll follow a similar approach. Create a streaming table that ingests data based on order IDs. Again, use APPLY_CHANGES
to merge the changes.
When new data arrives, DLT will handle inserts, updates, and deletes. Records with order IDs that no longer exist will be deleted from the target table.
Example:
-- Create another streaming table
CREATE STREAMING TABLE my_streaming_table_by_id
USING PARQUET
OPTIONS (
'path' = '/path/to/parquet/files'
);
-- Apply changes to the target table
APPLY CHANGES TO my_target_table
FROM my_streaming_table_by_id;
Unique Keys:
Refreshing Materialized Views:
Materialized views in DLT are powerful for aggregations and transformations. You can refresh them explicitly using the REFRESH
statement3.
For example:
-- Create a materialized view
CREATE MATERIALIZED VIEW mv1
AS SELECT order_id, SUM(amount) AS total_amount
FROM my_target_table
GROUP BY order_id;
-- Refresh the materialized view
REFRESH MATERIALIZED VIEW mv1;
DLT will automatically update the materialized view as new data arrives.
In summary, DLT provides the flexibility and power to handle incremental loads, deletes, and materialized views. While unique keys are essential, you can work around them by creating synthetic keys. Good luck with your PoC, and feel free to ask any further questions!
02-14-2024 02:33 AM
Hi @Kaniz_Fatma
First of all, thanks for your comprehansive answer. Although I appreciate your input, I'm missing the answer to address my problem: I don't want to delete all data from the previous load, only the data that 'overlaps' my new load;
Case 1
In this case, I load data from the source system using a 'sliding' date window: I'm fetching all rows from (let say) the last month, because rows can change or removed in this month. So I want to keep everything allready loaded prior to the last month and delete everything from last month and further (because these are in my new load).
Case 2
Same for this case: I don't want to delete the older ID's that no longer exists: They do exist, but they are not in my icremental data load anymore: I'm only fetching the order line for the ID's that are changed. So it must keep all ID's from prior load, and delete all IDs that are in the new incremental load.
How can DLT handles these deletes? I can't see how a synthetic key could help in these cases.
Do you understand my problem? Otherwise, I will prepare some samples.
03-11-2024 01:07 PM
Does anyone have an idea?
@Kaniz_Fatma Could you tell me if it's possible?
06-15-2024 05:09 AM
@_TJ did you find a solution for sliding window
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group