cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

APPLY_CHANGES late arriving data

Gilg
Contributor II

Hi Team,

I have a DLT pipeline that uses APPLY_CHANGES to our Silver tables. 

I am using Id as keys and timestamp to know the sequence of the incoming data. 

Question: How does APPLY_CHANGES handles late arriving data?

i.e., for silver_table_1, the data comes for Id 1 at 08/10/2023 at 10:30pm. This record got inserted to silver_table_1. Next day, same Id 1 data arrives but the timestamp is 07/10/2023 at 08:00pm. As you can see the timestamp is different and is less than from the inserted record. 

Cheers,

G

3 REPLIES 3

Kaniz_Fatma
Community Manager
Community Manager

Hi @Gilg , The APPLY_CHANGES function in Databricks Delta Live Tables handles late arriving data using a specified SEQUENCE BY column, which in your case is the timestamp. It uses this column to propagate appropriate sequencing values to the __START_AT and __END_AT columns of the target table. This helps maintain the proper ordering of records, even when data arrives late.In your scenario, where the data for the same Id arrives with a timestamp that is less than the timestamp of the already inserted record, APPLY_CHANGES will handle this by using the sequencing values based on the timestamp. It will adjust the __START_AT and __END_AT values of the records to maintain the correct sequence. 

However, it's important to note that you must ensure that your data manipulation language (DML) statements use valid values for these columns to maintain the proper ordering of records.

Hi Kaniz,

Will this work in SCD Type1 table? As I understand the __START_AT and __END_AT columns are only available in SCD Type 2 APPLY_CHANGES. When I do a quick select of the table using SQL Warehouse I do not see these columns in a Type 1 table.

Cheers,

G

Kaniz_Fatma
Community Manager
Community Manager

Hi @Gilg , You are correct. 

 

The __START_AT and __END_AT columns are specific to SCD Type 2 tables in Databricks. These columns are used to maintain the history of data changes, which is a characteristic of SCD Type 2. In contrast, SCD Type 1 tables do not maintain a history of changes, they simply overwrite the existing record with the new data, hence they do not require or contain these __START_AT and __END_AT columns.Here is the relevant information from the sources provided:- "Because uses a specified SEQUENCE BY column and propagates appropriate sequencing values to the __START_AT and ____END_AT columns of the target table (for SCD type 2), you must ensure that DML statements use valid values for these columns to maintain the proper ordering of records."


- "SCD type 2 updates will add a history row for every input row, even if no columns have changed."These statements confirm that the __START_AT and __END_AT columns are specific to SCD Type 2 tables and are used to maintain the history of data changes.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group