how to reliably get the timestamp of the last write/delete activity on a unity catalog table

GeKo
Contributor

Hi,

I'd like to know the best, most reliable, way to discover when the data in a UnityCatalog table was last modified (means 'added' or 'deleted'). Either python or SQL, doesn't matter....as long as I can use it within a scheduled job to run it periodically.

Any hint highly appreciated

SFDataEng
Contributor

I think it may be a trick question. There is not necessarily a single, always-available, best way, since there can be a diversity of object types, configurations, and load patterns for tables in Unity Catalog. The ways that are possible or that are best may depend on the data that (can be) available describing the object.

  • If I am wrong, please correct me.
  • Some options are below. An order of preference below is numeric, with more-preferred methods at the top.
  • Please add on, critique, or suggest better ones. (I probably missed some methods.)


Method 1: Check the _metadata.file_modification_time attribute of the object (if it exists; add it if needed/possible)
Some but not all tables/objects have a _metadata column

SELECT
    'catalog.schema.your_table' AS object_name,
    MAX(_metadata.file_modification_time) AS max_ts
FROM
    catalog.schema.your_table
WHERE
    _metadata.file_modification_time >= CURRENT_TIMESTAMP() - INTERVAL 7 DAYS

 

Method 2: Consume DESCRIBE HISTORY results for the object
Some tables history can be found by inspecting HISTORY (Questions to consider: Is there a slight lag for HISTORY being updated? Do all objects necessarily have HISTORY?)  Skeleton:
SELECT
    'catalog.schema.your_table' AS object_name,
    MAX(timestamp) as max_ts
FROM
    (DESCRIBE HISTORY catalog.schema.your_table);

 

Method 3: Check the last time of a successful job, for the job that loads that object (if you know the timing)
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
# Outline below -- you might have good luck asking an LLM to write it using Databricks ai-dev-tools for reference
#   1. Based on the job name or id that you know, get the job states
#   2. Find the job run with the latest successful state
#   3. That's the runtime of the last job that loaded the tabl
#   4. The data is not fresher than that, but would have data up to t


Method 4: Keep your own watermark table
A solution outline is omitted for brevity, since building your own metadata table is not a great answer for "where to look it up", but hopefully you get the idea -- have the last job write a time when it is done.  This can have certain pros/cons, and is probably not what the OP was asking for.

Louis_Frolio
Databricks Employee
Databricks Employee

Greetings @GeKo 

Good question.

Short answer: treat the Delta transaction log as your source of truth. Every write, delete, or merge on a Unity Catalog table creates a commit with a timestamp and operation type. DESCRIBE HISTORY gives you access to all of it.

The key move is filtering to only data-changing operations so you're not picking up maintenance noise like OPTIMIZE or VACUUM:

SELECT
  max(timestamp) AS last_data_change_ts
FROM (
  DESCRIBE HISTORY catalog_name.schema_name.table_name
)
WHERE operation IN ('WRITE', 'DELETE', 'MERGE', 'UPDATE');

You can broaden or tighten that operation filter depending on what counts as a "data change" in your environment. RESTORE, TRUNCATE, and CREATE TABLE AS SELECT are also valid candidates depending on your use case.

Python equivalent, ready for a scheduled job:

table_name = "catalog_name.schema_name.table_name"

df = spark.sql(f"""
  SELECT max(timestamp) AS last_data_change_ts
  FROM (
    DESCRIBE HISTORY {table_name}
  )
  WHERE operation IN ('WRITE', 'DELETE', 'MERGE', 'UPDATE')
""")

last_ts = df.collect()[0]["last_data_change_ts"]
print(f"Last data change for {table_name}: {last_ts}")

From there you can write last_ts into a monitoring table, compare it to current time, or trigger an alert if the table hasn't been updated in a while. This only reads the transaction log, not the table data, so it's lightweight.

If you don't need to distinguish between operation types and just want a quick timestamp, DESCRIBE DETAIL is a simpler option:

DESCRIBE DETAIL catalog_name.schema_name.table_name

That gives you a lastModified column reflecting the most recent modification. Less granular, but gets the job done for simpler checks.

One gotcha worth flagging: don't rely on the last_altered column in system.information_schema.tables. It only tracks DDL changes like schema modifications and table properties. It does not reflect DML operations (inserts, updates, deletes). If someone writes new rows to your table, last_altered won't move. This bites people regularly.

A couple things to keep in mind:

  • Delta table history is retained for 30 days by default (controlled by delta.logRetentionDuration). For most scheduled monitoring use cases that's plenty, but good to know the boundary.
  • If you ever need row-level change tracking rather than table-level, look into Change Data Feed. Different tool for a different job, but worth knowing it's there.

Hope that helps — let us know how it goes.

Cheers, Lou

View solution in original post

GeKo
Contributor

Many thanks for answering @SFDataEng  and @Louis_Frolio 
I'll try the recommendation from Lou first.....then approaching method 3 and 4 from SFDataEng , if required 😄