Databricks Community

sugunk · ‎07-25-2025

i have deep cloned a table, then did update but the update time stamp is less than deep clone timestamp version 0.
look like there is an issue in the deep clone.

here is the output, _commit_timestamp order is not in sync with _commit_version

timecard_transaction_id	_change_type	_commit_version	_commit_timestamp
214920856	update_preimage	1	2025-07-26T02:27:25+10:00
214920856	update_postimage	1	2025-07-26T02:27:25+10:00
214920856	insert	0	2025-07-26T09:27:33+10:00

lingareddy_Alva · ‎07-26-2025

Hi @sugunk

You're right to be confused—this behavior doesn't look quite right at first. Let's break it down and see what's really happening.
This is not a bug in Delta Lake but rather a quirk of how commit_timestamp works:
commit_timestamp reflects the wall-clock time at the source cluster/node performing the commit, not necessarily in version order.

So:
- Deep Clone (Version 0): When you cloned the table, it was committed at 2025-07-26 09:27:33+10:00.
- Update (Version 1): Happened later logically (in terms of Delta version),
but the commit happened on a node that had an earlier system time, or your cluster had clock skew (e.g., 2025-07-26 02:27:25+10:00).

Key Points to Know
1. commit_version is always consistent and monotonically increasing.
- Trust versioning for lineage and time travel, not commit_timestamp.

2. commit_timestamp is not guaranteed to be monotonic, especially across clusters, jobs, or time zones.

3. This behavior is documented (though subtly) in Delta Lake specs:
"commit timestamps are not strictly ordered and should not be used as a proxy for Delta transaction order."
Delta currently relies on the file modification time to identify the timestamp of a commit...
this can easily change when files are copied or moved... The possibility of
non‑monotonic file timestamps also adds lots of code complexity..
https://github.com/delta-io/delta/issues/2532?utm_source=chatgpt.com

Recommendations
- When tracking history, always use commit_version for order of changes.
- If consistency of commit_timestamp is required (e.g., for audits), ensure cluster time synchronization via NTP.
- You can enrich your metadata with a column like event_timestamp inside your data to track true event times.

LR

View solution in original post

lingareddy_Alva · ‎07-26-2025

Yes, this is an important thing to think about if you're switching to Databricks Serverless—especially if your data pipelines use commit_timestamp from Change Data Feed (CDF) to track or filter changes.

In serverless, you can't control or guarantee the exact system time on the machines running your jobs. So if you're using commit_timestamp to decide which data is new or has changed, it might not always be accurate or in the correct order. This could cause your pipeline to miss or duplicate changes.

LR

View solution in original post

lingareddy_Alva · ‎07-26-2025

Hi @sugunk

You're right to be confused—this behavior doesn't look quite right at first. Let's break it down and see what's really happening.
This is not a bug in Delta Lake but rather a quirk of how commit_timestamp works:
commit_timestamp reflects the wall-clock time at the source cluster/node performing the commit, not necessarily in version order.

So:
- Deep Clone (Version 0): When you cloned the table, it was committed at 2025-07-26 09:27:33+10:00.
- Update (Version 1): Happened later logically (in terms of Delta version),
but the commit happened on a node that had an earlier system time, or your cluster had clock skew (e.g., 2025-07-26 02:27:25+10:00).

Key Points to Know
1. commit_version is always consistent and monotonically increasing.
- Trust versioning for lineage and time travel, not commit_timestamp.

2. commit_timestamp is not guaranteed to be monotonic, especially across clusters, jobs, or time zones.

3. This behavior is documented (though subtly) in Delta Lake specs:
"commit timestamps are not strictly ordered and should not be used as a proxy for Delta transaction order."
Delta currently relies on the file modification time to identify the timestamp of a commit...
this can easily change when files are copied or moved... The possibility of
non‑monotonic file timestamps also adds lots of code complexity..
https://github.com/delta-io/delta/issues/2532?utm_source=chatgpt.com

Recommendations
- When tracking history, always use commit_version for order of changes.
- If consistency of commit_timestamp is required (e.g., for audits), ensure cluster time synchronization via NTP.
- You can enrich your metadata with a column like event_timestamp inside your data to track true event times.

LR

sugunk · ‎07-26-2025

do we need to do time synchronization in serverless. as we are planning to move from job cluster to serverless and using commit_timestamp for CDF it can cause issues.

thanks

sugun

lingareddy_Alva · ‎07-26-2025

Yes, this is an important thing to think about if you're switching to Databricks Serverless—especially if your data pipelines use commit_timestamp from Change Data Feed (CDF) to track or filter changes.

In serverless, you can't control or guarantee the exact system time on the machines running your jobs. So if you're using commit_timestamp to decide which data is new or has changed, it might not always be accurate or in the correct order. This could cause your pipeline to miss or duplicate changes.

LR

Databricks Community

incorrect commit timestamp after deep clone.

Join Us as a Local Community Builder!

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐