cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Does Lakeflow Connect guarantee no out-of-order records?

yit337
Contributor

I use Lakeflow Connect to load data from my source databases to bronze tables. Then I have auto_cdc to track SCD2 changes in my silver tables. I use _commit_timestamp from the bronze CDF, as sequence_by property in auto_cdc in order to order the version of records. Is that enough, or shall I include business timestamp column to handle out-of-order events?

1 ACCEPTED SOLUTION

Accepted Solutions

Lu_Wang_ENB_DBX
Databricks Employee
Databricks Employee

Recommendation: use a business/effective timestamp in sequence_by if your source can emit late/backdated changes and you want SCD2 history to reflect source event time, not bronze arrival/commit time. If ties are possible, use a STRUCT for deterministic ordering, e.g. STRUCT(business_ts, _commit_timestamp). AUTO CDC uses SEQUENCE BY as the logical order of CDC events, handles out-of-order arrivals, and supports multi-column sequencing via STRUCT.

Options

  1. Keep _commit_timestamp only โ€” good if bronze commit order is already the business order you want. Databricks docs do show this pattern for Delta CDF examples.
  2. Use business timestamp only โ€” best if source events can arrive late and must be ordered by source/effective time, not ingestion time. AUTO CDC docs describe SEQUENCE BY as the logical order from the source data.
  3. Use STRUCT(business_ts, _commit_timestamp) โ€” best balance in practice: business time decides history, commit time breaks ties deterministically.

Summary: _commit_timestamp alone is enough when โ€œarrival order = desired history order.โ€ Otherwise include business timestamp.

View solution in original post

2 REPLIES 2

pradeep_singh
Contributor III

To process late arriving data correctly you would need a business column that identifies when the data was created/updated at the source and have that as the first column in your struct of sequence_by columns . 

A right sequence_by column/columns must be a monotonically increasing representation of the correct event order, with one distinct update per key at each sequencing value ._commit_timestamp alone is only enough if bronze commit time is the exact event order you want to preserve.

Thank You
Pradeep Singh - https://www.linkedin.com/in/dbxdev

Lu_Wang_ENB_DBX
Databricks Employee
Databricks Employee

Recommendation: use a business/effective timestamp in sequence_by if your source can emit late/backdated changes and you want SCD2 history to reflect source event time, not bronze arrival/commit time. If ties are possible, use a STRUCT for deterministic ordering, e.g. STRUCT(business_ts, _commit_timestamp). AUTO CDC uses SEQUENCE BY as the logical order of CDC events, handles out-of-order arrivals, and supports multi-column sequencing via STRUCT.

Options

  1. Keep _commit_timestamp only โ€” good if bronze commit order is already the business order you want. Databricks docs do show this pattern for Delta CDF examples.
  2. Use business timestamp only โ€” best if source events can arrive late and must be ordered by source/effective time, not ingestion time. AUTO CDC docs describe SEQUENCE BY as the logical order from the source data.
  3. Use STRUCT(business_ts, _commit_timestamp) โ€” best balance in practice: business time decides history, commit time breaks ties deterministically.

Summary: _commit_timestamp alone is enough when โ€œarrival order = desired history order.โ€ Otherwise include business timestamp.