Monday
Hi,
I'd like to know the best, most reliable, way to discover when the data in a UnityCatalog table was last modified (means 'added' or 'deleted'). Either python or SQL, doesn't matter....as long as I can use it within a scheduled job to run it periodically.
Any hint highly appreciated
Monday
Greetings @GeKo
Good question.
Short answer: treat the Delta transaction log as your source of truth. Every write, delete, or merge on a Unity Catalog table creates a commit with a timestamp and operation type. DESCRIBE HISTORY gives you access to all of it.
The key move is filtering to only data-changing operations so you're not picking up maintenance noise like OPTIMIZE or VACUUM:
SELECT
max(timestamp) AS last_data_change_ts
FROM (
DESCRIBE HISTORY catalog_name.schema_name.table_name
)
WHERE operation IN ('WRITE', 'DELETE', 'MERGE', 'UPDATE');
You can broaden or tighten that operation filter depending on what counts as a "data change" in your environment. RESTORE, TRUNCATE, and CREATE TABLE AS SELECT are also valid candidates depending on your use case.
Python equivalent, ready for a scheduled job:
table_name = "catalog_name.schema_name.table_name"
df = spark.sql(f"""
SELECT max(timestamp) AS last_data_change_ts
FROM (
DESCRIBE HISTORY {table_name}
)
WHERE operation IN ('WRITE', 'DELETE', 'MERGE', 'UPDATE')
""")
last_ts = df.collect()[0]["last_data_change_ts"]
print(f"Last data change for {table_name}: {last_ts}")
From there you can write last_ts into a monitoring table, compare it to current time, or trigger an alert if the table hasn't been updated in a while. This only reads the transaction log, not the table data, so it's lightweight.
If you don't need to distinguish between operation types and just want a quick timestamp, DESCRIBE DETAIL is a simpler option:
DESCRIBE DETAIL catalog_name.schema_name.table_name
That gives you a lastModified column reflecting the most recent modification. Less granular, but gets the job done for simpler checks.
One gotcha worth flagging: don't rely on the last_altered column in system.information_schema.tables. It only tracks DDL changes like schema modifications and table properties. It does not reflect DML operations (inserts, updates, deletes). If someone writes new rows to your table, last_altered won't move. This bites people regularly.
A couple things to keep in mind:
Hope that helps โ let us know how it goes.
Cheers, Lou
Monday - last edited Monday
I think it may be a trick question. There is not necessarily a single, always-available, best way, since there can be a diversity of object types, configurations, and load patterns for tables in Unity Catalog. The ways that are possible or that are best may depend on the data that (can be) available describing the object.
Method 1: Check the _metadata.file_modification_time attribute of the object (if it exists; add it if needed/possible)
Some but not all tables/objects have a _metadata column
SELECT
'catalog.schema.your_table' AS object_name,
MAX(_metadata.file_modification_time) AS max_ts
FROM
catalog.schema.your_table
WHERE
_metadata.file_modification_time >= CURRENT_TIMESTAMP() - INTERVAL 7 DAYS
Method 2: Consume DESCRIBE HISTORY results for the object
Some tables history can be found by inspecting HISTORY (Questions to consider: Is there a slight lag for HISTORY being updated? Do all objects necessarily have HISTORY?) Skeleton:
SELECT
'catalog.schema.your_table' AS object_name,
MAX(timestamp) as max_ts
FROM
(DESCRIBE HISTORY catalog.schema.your_table);
Method 3: Check the last time of a successful job, for the job that loads that object (if you know the timing)
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
# Outline below -- you might have good luck asking an LLM to write it using Databricks ai-dev-tools for reference
# 1. Based on the job name or id that you know, get the job states
# 2. Find the job run with the latest successful state
# 3. That's the runtime of the last job that loaded the tabl
# 4. The data is not fresher than that, but would have data up to t
Method 4: Keep your own watermark table
A solution outline is omitted for brevity, since building your own metadata table is not a great answer for "where to look it up", but hopefully you get the idea -- have the last job write a time when it is done. This can have certain pros/cons, and is probably not what the OP was asking for.
Monday
Greetings @GeKo
Good question.
Short answer: treat the Delta transaction log as your source of truth. Every write, delete, or merge on a Unity Catalog table creates a commit with a timestamp and operation type. DESCRIBE HISTORY gives you access to all of it.
The key move is filtering to only data-changing operations so you're not picking up maintenance noise like OPTIMIZE or VACUUM:
SELECT
max(timestamp) AS last_data_change_ts
FROM (
DESCRIBE HISTORY catalog_name.schema_name.table_name
)
WHERE operation IN ('WRITE', 'DELETE', 'MERGE', 'UPDATE');
You can broaden or tighten that operation filter depending on what counts as a "data change" in your environment. RESTORE, TRUNCATE, and CREATE TABLE AS SELECT are also valid candidates depending on your use case.
Python equivalent, ready for a scheduled job:
table_name = "catalog_name.schema_name.table_name"
df = spark.sql(f"""
SELECT max(timestamp) AS last_data_change_ts
FROM (
DESCRIBE HISTORY {table_name}
)
WHERE operation IN ('WRITE', 'DELETE', 'MERGE', 'UPDATE')
""")
last_ts = df.collect()[0]["last_data_change_ts"]
print(f"Last data change for {table_name}: {last_ts}")
From there you can write last_ts into a monitoring table, compare it to current time, or trigger an alert if the table hasn't been updated in a while. This only reads the transaction log, not the table data, so it's lightweight.
If you don't need to distinguish between operation types and just want a quick timestamp, DESCRIBE DETAIL is a simpler option:
DESCRIBE DETAIL catalog_name.schema_name.table_name
That gives you a lastModified column reflecting the most recent modification. Less granular, but gets the job done for simpler checks.
One gotcha worth flagging: don't rely on the last_altered column in system.information_schema.tables. It only tracks DDL changes like schema modifications and table properties. It does not reflect DML operations (inserts, updates, deletes). If someone writes new rows to your table, last_altered won't move. This bites people regularly.
A couple things to keep in mind:
Hope that helps โ let us know how it goes.
Cheers, Lou
yesterday
Many thanks for answering @SFDataEng and @Louis_Frolio
I'll try the recommendation from Lou first.....then approaching method 3 and 4 from SFDataEng , if required ๐