cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Scheduling jobs with table update triggers

Garybary
New Contributor III

Hi all,

Lately I've been experimenting with the newish feature of scheduling jobs on a table update trigger. There's one thing thats blokcing me from implementing it however and I was hoping someone found a solution to it.

We occasionally perform a vacuum operation in our PRD environment. This is not synchronized across the platform and as such some of my DP's in the platform receive an update but others dont. When these then get updated on a daily load the day after, it oftentimes begins to run with only half of its tables actually updated.

Do you know of ways to filter on specific table update trigger types (e.g. we can filter on specific operations) or ways to make this more stable?

Thanks for reading! 

3 REPLIES 3

saurabh18cs
Honored Contributor III

Hi @Garybary I am not sure if this trigger supports operations filtering or not but I could suggest you another approach where everytime this job gets triggered you can add a extra check on the first cell of this pipeline notebook or code:

tbl = "catalog.schema.table_name" 
last = (spark.sql(f"DESCRIBE HISTORY {tbl}")
          .orderBy("version", ascending=False)
          .limit(1)
          .collect()[0]) 
op = (last["operation"] or "").upper() 
MAINT_OPS = {"VACUUM", "OPTIMIZE", "ZORDER BY"} 
if op in MAINT_OPS:
    dbutils.notebook.exit(f"SKIP: maintenance operation detected: {op}")

 

Garybary
New Contributor III

Hi @saurabh18cs, Thanks for thinking along! I think your methodology works well but I feel like there are some downsides to it. E.g. it still launches the required compute before being told to shut down again. I read some other post that uses the time window for updating the job trigger, that might also provide a solution. By putting the window on +-23 hours we could also stop all unintended updates to trigger a run. It's a bit of a brute force method though :).

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @Garybary,

This is a common scenario when using table update triggers. Currently, table update triggers do not support filtering by operation type. The trigger fires on any commit to the Delta transaction log, and VACUUM does write a commit entry to the log (you can verify this with DESCRIBE HISTORY on the table). So a VACUUM on any of your monitored tables will look like a "table update" to the trigger, even though no actual data changed.

Here are a few approaches to work around this:

OPTION 1: ADD A GUARD CHECK AT THE START OF YOUR JOB

Use the dynamic job parameter that provides the list of updated tables along with commit metadata. In the first task of your job, query DESCRIBE HISTORY on each triggered table and check whether the most recent commit was a data-changing operation or just a maintenance operation (VACUUM, OPTIMIZE, etc.).

Delta history includes a column called "operation" which tells you exactly what type of commit it was (WRITE, MERGE, DELETE, VACUUM, OPTIMIZE, etc.). You can use this to short-circuit the job early if the trigger was caused only by maintenance operations.

Example Python task at the start of your job:

import json
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Get the list of updated tables from the trigger parameter
updated_tables_json = dbutils.widgets.get("updated_tables")
updated_tables = json.loads(updated_tables_json)

# Check if any table had a real data change (not just VACUUM/OPTIMIZE)
maintenance_ops = {"VACUUM", "OPTIMIZE", "FSCK", "REORG"}
has_data_change = False

for table_name in updated_tables:
history = spark.sql(f"DESCRIBE HISTORY {table_name} LIMIT 1").collect()
if history:
last_op = history[0]["operation"]
if last_op not in maintenance_ops:
has_data_change = True
break

if not has_data_change:
dbutils.notebook.exit("SKIP: triggered by maintenance operation only")

Then configure the downstream tasks to use "Run if dependencies" conditions so they only run when the guard task succeeds with a data-change result.

When setting up this job, pass the dynamic parameter to the first task by adding a job parameter:

Key: updated_tables
Value: {{job.trigger.table_update.updated_tables}}

OPTION 2: USE THE "ALL TABLES UPDATED" CONDITION

If your job depends on multiple source tables all being refreshed, set the trigger condition to "All tables updated" rather than "Any table updated." This way, a VACUUM on just one or two tables will not cause the job to run. The trigger will only fire once all monitored tables have been updated since the last run. This does not completely solve the problem (if all tables get vacuumed, it would still fire), but it significantly reduces false triggers in environments where VACUUM runs are staggered.

OPTION 3: COORDINATE YOUR VACUUM SCHEDULE

Consider scheduling VACUUM operations during a known maintenance window and combine them with the "Minimum time between triggers" setting. For example, if your daily data loads happen at 8:00 AM and your VACUUM jobs run at 2:00 AM, you could set the minimum time between triggers to avoid responding to changes during the maintenance window. This is not a filter, but it can help batch the trigger so that a VACUUM at 2:00 AM and a data load at 8:00 AM do not cause two separate runs.

OPTION 4: USE A WRAPPER JOB PATTERN

Instead of triggering your main pipeline directly, create a lightweight "dispatcher" job that fires on table updates. The dispatcher checks whether the updates are data changes (using the DESCRIBE HISTORY approach from Option 1), and if so, programmatically triggers your main pipeline job via the Databricks Jobs API.

import requests

# After confirming a real data change occurred...
response = requests.post(
f"https://{workspace_url}/api/2.1/jobs/run-now",
headers={"Authorization": f"Bearer {token}"},
json={"job_id": YOUR_MAIN_JOB_ID}
)

RELEVANT DOCUMENTATION

- Table update triggers: https://docs.databricks.com/aws/en/jobs/trigger-table-update
- Job triggers overview: https://docs.databricks.com/aws/en/jobs/triggers
- Delta table history: https://docs.databricks.com/aws/en/delta/history
- VACUUM documentation: https://docs.databricks.com/aws/en/delta/vacuum
- Dynamic job parameters for triggers: https://docs.databricks.com/aws/en/jobs/trigger-table-update (see "Reference updated tables and commit timestamps in job configurations")

The ability to filter triggers by operation type would be a useful enhancement. If this is important to your workflow, I would encourage submitting a feature request through the Databricks Ideas portal, as community votes help prioritize the roadmap.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.