DLT - Handling Merge

JothyGanesan — Thu, 06 Feb 2025 13:11:52 GMT

Hi,

In our DLT pipeline we are reading two tables. One a Apply Changes table Delta table and a streaming live table. We are able to read the latest records from the streaming live table incrementally but from the apply changes we are not able to read the incremental data. We now need to manage joining these two tables in real time to do SCD Type 1 in our target DLT table.
Challenges: Both the source though has the common records can come with a time lag, so real time inner join we are losing records. Reading the apply changes incremental we are unable to do.
So how to handle this situation to get the records from both the sources inserted/updated into our target DLT without losing records?

Re: DLT - Handling Merge

Walter_C — Thu, 06 Feb 2025 13:29:25 GMT

To address the challenges you are facing with your Delta Live Tables (DLT) pipeline, here are some steps and considerations to help you manage the incremental data reading and joining of the Apply Changes table and the streaming live table for SCD Type 1 processing:

Incremental Data Reading from Apply Changes Table:
- Ensure that the Apply Changes table is set up to capture changes using the APPLY CHANGES API. This API is designed to handle change data capture (CDC) efficiently.
- Use the apply_changes() function in Python to specify the source, keys, and sequencing for the change feed. This function will help you process changes incrementally.
Handling Out-of-Order Data:
- The APPLY CHANGES API automatically handles out-of-sequence records, ensuring correct processing of CDC records. You need to specify a column in the source data to sequence records, which Delta Live Tables interprets as a monotonically increasing representation of the proper ordering of the source data.
Joining Tables with Time Lag:
- To manage the time lag between the two sources, consider using a watermark to handle late data. This can help ensure that you do not lose records during the join operation.
- Use the apply_changes() function to create a streaming table and then join it with the streaming live table. This approach ensures that both tables are processed in real-time.

topic Re: DLT - Handling Merge in Data Engineering

DLT - Handling Merge

Re: DLT - Handling Merge