cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

APPLY_CHANGES late arriving data

Gilg
Contributor II

Hi Team,

I have a DLT pipeline that uses APPLY_CHANGES to our Silver tables. 

I am using Id as keys and timestamp to know the sequence of the incoming data. 

Question: How does APPLY_CHANGES handles late arriving data?

i.e., for silver_table_1, the data comes for Id 1 at 08/10/2023 at 10:30pm. This record got inserted to silver_table_1. Next day, same Id 1 data arrives but the timestamp is 07/10/2023 at 08:00pm. As you can see the timestamp is different and is less than from the inserted record. 

Cheers,

G

3 REPLIES 3

Kaniz
Community Manager
Community Manager

Hi @Gilg , The APPLY_CHANGES function in Databricks Delta Live Tables handles late arriving data using a specified SEQUENCE BY column, which in your case is the timestamp. It uses this column to propagate appropriate sequencing values to the __START_AT and __END_AT columns of the target table. This helps maintain the proper ordering of records, even when data arrives late.In your scenario, where the data for the same Id arrives with a timestamp that is less than the timestamp of the already inserted record, APPLY_CHANGES will handle this by using the sequencing values based on the timestamp. It will adjust the __START_AT and __END_AT values of the records to maintain the correct sequence. 

However, it's important to note that you must ensure that your data manipulation language (DML) statements use valid values for these columns to maintain the proper ordering of records.

Gilg
Contributor II

Hi Kaniz,

Will this work in SCD Type1 table? As I understand the __START_AT and __END_AT columns are only available in SCD Type 2 APPLY_CHANGES. When I do a quick select of the table using SQL Warehouse I do not see these columns in a Type 1 table.

Cheers,

G

Kaniz
Community Manager
Community Manager

Hi @Gilg , You are correct. 

 

The __START_AT and __END_AT columns are specific to SCD Type 2 tables in Databricks. These columns are used to maintain the history of data changes, which is a characteristic of SCD Type 2. In contrast, SCD Type 1 tables do not maintain a history of changes, they simply overwrite the existing record with the new data, hence they do not require or contain these __START_AT and __END_AT columns.Here is the relevant information from the sources provided:- "Because uses a specified SEQUENCE BY column and propagates appropriate sequencing values to the __START_AT and ____END_AT columns of the target table (for SCD type 2), you must ensure that DML statements use valid values for these columns to maintain the proper ordering of records."


- "SCD type 2 updates will add a history row for every input row, even if no columns have changed."These statements confirm that the __START_AT and __END_AT columns are specific to SCD Type 2 tables and are used to maintain the history of data changes.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.