โ07-25-2025 08:29 PM
i have deep cloned a table, then did update but the update time stamp is less than deep clone timestamp version 0.
look like there is an issue in the deep clone.
timecard_transaction_id | _change_type | _commit_version | _commit_timestamp |
214920856 | update_preimage | 1 | 2025-07-26T02:27:25+10:00 |
214920856 | update_postimage | 1 | 2025-07-26T02:27:25+10:00 |
214920856 | insert | 0 | 2025-07-26T09:27:33+10:00 |
โ07-26-2025 11:09 AM - edited โ07-26-2025 11:13 AM
Hi @sugunk
You're right to be confusedโthis behavior doesn't look quite right at first. Let's break it down and see what's really happening.
This is not a bug in Delta Lake but rather a quirk of how commit_timestamp works:
commit_timestamp reflects the wall-clock time at the source cluster/node performing the commit, not necessarily in version order.
So:
- Deep Clone (Version 0): When you cloned the table, it was committed at 2025-07-26 09:27:33+10:00.
- Update (Version 1): Happened later logically (in terms of Delta version),
but the commit happened on a node that had an earlier system time, or your cluster had clock skew (e.g., 2025-07-26 02:27:25+10:00).
Key Points to Know
1. commit_version is always consistent and monotonically increasing.
- Trust versioning for lineage and time travel, not commit_timestamp.
2. commit_timestamp is not guaranteed to be monotonic, especially across clusters, jobs, or time zones.
3. This behavior is documented (though subtly) in Delta Lake specs:
"commit timestamps are not strictly ordered and should not be used as a proxy for Delta transaction order."
Delta currently relies on the file modification time to identify the timestamp of a commit...
this can easily change when files are copied or moved... The possibility of
nonโmonotonic file timestamps also adds lots of code complexity..
https://github.com/delta-io/delta/issues/2532?utm_source=chatgpt.com
Recommendations
- When tracking history, always use commit_version for order of changes.
- If consistency of commit_timestamp is required (e.g., for audits), ensure cluster time synchronization via NTP.
- You can enrich your metadata with a column like event_timestamp inside your data to track true event times.
โ07-26-2025 02:50 PM
Yes, this is an important thing to think about if you're switching to Databricks Serverlessโespecially if your data pipelines use commit_timestamp from Change Data Feed (CDF) to track or filter changes.
In serverless, you can't control or guarantee the exact system time on the machines running your jobs. So if you're using commit_timestamp to decide which data is new or has changed, it might not always be accurate or in the correct order. This could cause your pipeline to miss or duplicate changes.
โ07-26-2025 11:09 AM - edited โ07-26-2025 11:13 AM
Hi @sugunk
You're right to be confusedโthis behavior doesn't look quite right at first. Let's break it down and see what's really happening.
This is not a bug in Delta Lake but rather a quirk of how commit_timestamp works:
commit_timestamp reflects the wall-clock time at the source cluster/node performing the commit, not necessarily in version order.
So:
- Deep Clone (Version 0): When you cloned the table, it was committed at 2025-07-26 09:27:33+10:00.
- Update (Version 1): Happened later logically (in terms of Delta version),
but the commit happened on a node that had an earlier system time, or your cluster had clock skew (e.g., 2025-07-26 02:27:25+10:00).
Key Points to Know
1. commit_version is always consistent and monotonically increasing.
- Trust versioning for lineage and time travel, not commit_timestamp.
2. commit_timestamp is not guaranteed to be monotonic, especially across clusters, jobs, or time zones.
3. This behavior is documented (though subtly) in Delta Lake specs:
"commit timestamps are not strictly ordered and should not be used as a proxy for Delta transaction order."
Delta currently relies on the file modification time to identify the timestamp of a commit...
this can easily change when files are copied or moved... The possibility of
nonโmonotonic file timestamps also adds lots of code complexity..
https://github.com/delta-io/delta/issues/2532?utm_source=chatgpt.com
Recommendations
- When tracking history, always use commit_version for order of changes.
- If consistency of commit_timestamp is required (e.g., for audits), ensure cluster time synchronization via NTP.
- You can enrich your metadata with a column like event_timestamp inside your data to track true event times.
โ07-26-2025 02:26 PM
do we need to do time synchronization in serverless. as we are planning to move from job cluster to serverless and using commit_timestamp for CDF it can cause issues.
thanks
sugun
โ07-26-2025 02:50 PM
Yes, this is an important thing to think about if you're switching to Databricks Serverlessโespecially if your data pipelines use commit_timestamp from Change Data Feed (CDF) to track or filter changes.
In serverless, you can't control or guarantee the exact system time on the machines running your jobs. So if you're using commit_timestamp to decide which data is new or has changed, it might not always be accurate or in the correct order. This could cause your pipeline to miss or duplicate changes.
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now