cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

incorrect commit timestamp after deep clone.

sugunk
New Contributor II

i have deep cloned a table, then did update but the update time stamp is less than deep clone timestamp version 0.
look like there is an issue in the deep clone. 

 
here is the output, _commit_timestamp order is not in sync with _commit_version

timecard_transaction_id_change_type_commit_version_commit_timestamp
214920856update_preimage12025-07-26T02:27:25+10:00
214920856update_postimage12025-07-26T02:27:25+10:00
214920856insert02025-07-26T09:27:33+10:00
2 ACCEPTED SOLUTIONS

Accepted Solutions

lingareddy_Alva
Honored Contributor III

Hi @sugunk 

You're right to be confusedโ€”this behavior doesn't look quite right at first. Let's break it down and see what's really happening.
This is not a bug in Delta Lake but rather a quirk of how commit_timestamp works:
commit_timestamp reflects the wall-clock time at the source cluster/node performing the commit, not necessarily in version order.

So:
- Deep Clone (Version 0): When you cloned the table, it was committed at 2025-07-26 09:27:33+10:00.
- Update (Version 1): Happened later logically (in terms of Delta version),
but the commit happened on a node that had an earlier system time, or your cluster had clock skew (e.g., 2025-07-26 02:27:25+10:00).

Key Points to Know
1. commit_version is always consistent and monotonically increasing.
- Trust versioning for lineage and time travel, not commit_timestamp.

2. commit_timestamp is not guaranteed to be monotonic, especially across clusters, jobs, or time zones.

3. This behavior is documented (though subtly) in Delta Lake specs:
"commit timestamps are not strictly ordered and should not be used as a proxy for Delta transaction order."
Delta currently relies on the file modification time to identify the timestamp of a commit...
this can easily change when files are copied or moved... The possibility of
nonโ€‘monotonic file timestamps also adds lots of code complexity..
https://github.com/delta-io/delta/issues/2532?utm_source=chatgpt.com

Recommendations
- When tracking history, always use commit_version for order of changes.
- If consistency of commit_timestamp is required (e.g., for audits), ensure cluster time synchronization via NTP.
- You can enrich your metadata with a column like event_timestamp inside your data to track true event times.

 

LR

View solution in original post

lingareddy_Alva
Honored Contributor III

Yes, this is an important thing to think about if you're switching to Databricks Serverlessโ€”especially if your data pipelines use commit_timestamp from Change Data Feed (CDF) to track or filter changes.

In serverless, you can't control or guarantee the exact system time on the machines running your jobs. So if you're using commit_timestamp to decide which data is new or has changed, it might not always be accurate or in the correct order. This could cause your pipeline to miss or duplicate changes.

 

LR

View solution in original post

3 REPLIES 3

lingareddy_Alva
Honored Contributor III

Hi @sugunk 

You're right to be confusedโ€”this behavior doesn't look quite right at first. Let's break it down and see what's really happening.
This is not a bug in Delta Lake but rather a quirk of how commit_timestamp works:
commit_timestamp reflects the wall-clock time at the source cluster/node performing the commit, not necessarily in version order.

So:
- Deep Clone (Version 0): When you cloned the table, it was committed at 2025-07-26 09:27:33+10:00.
- Update (Version 1): Happened later logically (in terms of Delta version),
but the commit happened on a node that had an earlier system time, or your cluster had clock skew (e.g., 2025-07-26 02:27:25+10:00).

Key Points to Know
1. commit_version is always consistent and monotonically increasing.
- Trust versioning for lineage and time travel, not commit_timestamp.

2. commit_timestamp is not guaranteed to be monotonic, especially across clusters, jobs, or time zones.

3. This behavior is documented (though subtly) in Delta Lake specs:
"commit timestamps are not strictly ordered and should not be used as a proxy for Delta transaction order."
Delta currently relies on the file modification time to identify the timestamp of a commit...
this can easily change when files are copied or moved... The possibility of
nonโ€‘monotonic file timestamps also adds lots of code complexity..
https://github.com/delta-io/delta/issues/2532?utm_source=chatgpt.com

Recommendations
- When tracking history, always use commit_version for order of changes.
- If consistency of commit_timestamp is required (e.g., for audits), ensure cluster time synchronization via NTP.
- You can enrich your metadata with a column like event_timestamp inside your data to track true event times.

 

LR

sugunk
New Contributor II

do we need to do time synchronization in serverless. as we are planning to move from job cluster to serverless and using commit_timestamp for CDF it can cause issues. 

thanks

sugun

 

lingareddy_Alva
Honored Contributor III

Yes, this is an important thing to think about if you're switching to Databricks Serverlessโ€”especially if your data pipelines use commit_timestamp from Change Data Feed (CDF) to track or filter changes.

In serverless, you can't control or guarantee the exact system time on the machines running your jobs. So if you're using commit_timestamp to decide which data is new or has changed, it might not always be accurate or in the correct order. This could cause your pipeline to miss or duplicate changes.

 

LR