topic Re: Orchestrating Irregular Databricks Jobs from external source Timestamps in Community Articles

Orchestrating Irregular Databricks Jobs from external source Timestamps

PiotrPustola — Wed, 25 Feb 2026 08:34:53 GMT

Works for any event-driven workload: IoT alerts, e-commerce flash sales, financial market close processing.

Goal

In this project, I needed to start Databricks jobs on an irregular basis, driven entirely by timestamps stored in PostgreSQL rather than by a fixed schedule.

The concrete use case was processing football match data right after the final whistle. Because matches have irregular kick-off times and variable durations, it was not possible to define a simple, fixed schedule for job runs. Instead, the requirement was to trigger a job at `match_start_time + 105 minutes`, which corresponds to the earliest possible end time of a match.

Challenge

On a previous project, I had solved a similar problem using Apache Airflow. There, I could rely on a timetable schedule: an external source with a collection of timestamps that defines when a DAG (Airflow job) should be triggered. Airflow would automatically poll that source and trigger the DAG at the appropriate times.

When I moved to Databricks, I could not find a direct equivalent of this timetable-style scheduling. Out of the box, Databricks jobs can be triggered in the following ways:

- Using a CRON schedule
- On file arrival
- On table update (not yet available at the time of this implementation)
- Via REST API

None of these options provided the same "pull timestamps from an external list and schedule dynamically" behavior that I previously had with Airflow timetables.

Idea: A Self-Rescheduling Orchestrator Job

To bridge this gap, I designed an orchestrator job responsible for triggering ETL jobs at the specific timestamps stored in PostgreSQL.

The high-level approach was:

When the orchestrator job runs, it queries PostgreSQL for match data based on the job execution timestamp
It selects all matches that start around "now" plus a time buffer (for example, all matches that start at the current time plus some offset)
For each such game, the orchestrator triggers the appropriate ETL jobs
After triggering those ETL jobs, the orchestrator reschedules **itself** to run at the next relevant timestamp taken from PostgreSQL

What happens if there are no upcoming matches in PostgreSQL?
There is one more component running at fixed schedules: a job that fetches new match datetimes and populates PostgreSQL. It is also responsible for checking whether the orchestrator is scheduled; if not, it schedules it for the earliest upcoming match.

Why not run-now? A natural alternative might be using the REST API run-now endpoint in a polling loop from an external scheduler. However, cron updates are preferable here because they enable precise, native scheduling without external timers or continuous polling.

Second-level precision: Cron updates let Databricks handle exact timing using its managed scheduler—no drift from external cron jobs or polling intervals
Zero idle compute: Unlike polling every X minutes, the job sleeps until the exact timestamp, minimizing costs
Fault tolerance: Databricks retries failed schedules automatically; external polling would need custom retry logic
Simplicity: One job self-manages its lifecycle vs. maintaining separate poller + trigger infrastructure.

This keeps everything within Databricks while achieving Airflow timetable-like behavior.

Why not use newer Databricks features like table triggers or Spark Declarative Pipelines?
By the time of implementation, table update triggers weren't available. Now they enable event-driven job execution when Delta/Iceberg tables are updated (merge/delete), with dynamic parameters like commit timestamp. However, they require mirroring PostgreSQL data into Delta tables first, adding sync overhead and latency—unsuitable for direct timestamp-driven scheduling from an external relational database.

Spark Declarative Pipelines (part of Lakeflow) excel at no-code ETL orchestration with automated lineage and retries, but their scheduling still relies on CRON, file arrival, or table triggers—not dynamic querying of external PostgreSQL timestamps to compute precise execution windows.

This self-rescheduling is the key part of the solution. It turns the job into a precise, event-driven loop that reprograms its next execution time according to the latest information in PostgreSQL—without requiring continuous running and polling of the database, which would incur unnecessary costs and wasted compute time, unlike a static CRON expression defined once.

Implementation: Updating the Job Schedule via REST API

Rescheduling the job is done through the Databricks REST API, using the `/api/2.2/jobs/update` endpoint and providing a new schedule in the `new_settings` object.

Converting Python datetime to Quartz Cron format
Databricks uses Quartz Cron expressions for schedules, which follow the format: `[second] [minute] [hour] [day of month] [month] [day of week] [year]`. I created a helper function to convert a Python `datetime.datetime` object to this format:

def datetime_to_quartz_cron(dt: datetime) -> str: """Convert datetime to Quartz cron expression. Quartz cron format: [second] [minute] [hour] [day of month] [month] [day of week] [year] http://www.quartz-scheduler.org/documentation/quartz-2.3.0/tutorials/crontrigger.html Parameters ---------- dt : datetime.datetime Datetime to convert Returns ------- str Quartz cron expression """ return f"{dt.second} {dt.minute} {dt.hour} {dt.day} {dt.month} ? {dt.year}"

The core Python snippet that performs the reschedule looks like this:

from databricks.sdk import WorkspaceClient from databricks.sdk.service.jobs import CronSchedule, JobSettings new_cron_schedule = CronSchedule( quartz_cron_expression=new_schedule_date_quartz_cron_format, timezone_id="UTC", ) new_settings = JobSettings(schedule=new_cron_schedule) workspace_client = WorkspaceClient(**config) workspace_client.jobs.update(job_id=job_id, new_settings=new_settings)

In this pattern:

The orchestrator computes `new_schedule_date_quartz_cron_format` (using the helper function above) based on the next batch of matches stored in PostgreSQL
It then constructs a new `CronSchedule` and wraps it into `JobSettings`
Finally, it calls `jobs.update` for its own `job_id`, effectively changing its next trigger time to align with the next match window

This provides a flexible, data-driven schedule: as soon as match data changes in PostgreSQL, the orchestrator will adapt its future runs accordingly, without manual intervention or static CRON maintenance.

Re: Orchestrating Irregular Databricks Jobs from external source Timestamps

Kirankumarbs — Wed, 04 Mar 2026 09:55:30 GMT

Thanks @PiotrPustola, for sharing such an interesting problem!

We currently use file-based and table-based triggers in our production setup, but it’s always good to know about other possibilities and approaches like this.

Re: Orchestrating Irregular Databricks Jobs from external source Timestamps

SteveOstrowski — Sun, 08 Mar 2026 05:01:19 GMT

@PiotrPustola -- The self-rescheduling orchestrator pattern is a really elegant solution for event-driven workloads that depend on externally managed timestamps. A few thoughts and additions that might help you and others who land on this article:

ADDITIONAL TIPS FOR THE SELF-RESCHEDULING PATTERN

1. Pause status awareness: When you update the schedule via the Jobs API, make sure to explicitly set pause_status to "UNPAUSED" in the CronSchedule object. If a job was paused for any reason (manual intervention, maintenance, etc.), just updating the quartz_cron_expression alone will not unpause it. Here is an example:

new_cron_schedule = CronSchedule(
quartz_cron_expression=new_schedule_date_quartz_cron_format,
timezone_id="UTC",
pause_status=PauseStatus.UNPAUSED
)

2. Idempotency and race conditions: If the orchestrator fails partway through (after triggering ETL jobs but before rescheduling itself), you could end up in a state where it never runs again. A good safety net is a separate lightweight "watchdog" job on a fixed daily CRON that checks whether the orchestrator has a valid future schedule, and reschedules it if not. It sounds like your match-fetching job already handles this, which is a solid design.

3. Scheduler latency: The Databricks documentation notes that the job scheduler is not designed for sub-second precision and may experience delays of up to several minutes due to infrastructure conditions. For your use case (triggering 105 minutes after kickoff) this is fine, but anyone adapting this pattern for tighter timing windows should be aware of this.

4. Using the Reset vs Update endpoint: You correctly use the Update endpoint (POST /api/2.2/jobs/update), which only modifies the fields you specify. Be careful not to accidentally use the Reset endpoint (POST /api/2.2/jobs/reset), which overwrites the entire job configuration and could wipe out task definitions, cluster settings, and other configuration if you only pass in the schedule.

TABLE UPDATE TRIGGERS AS A COMPLEMENT

You mentioned that table update triggers were not available at the time of your implementation. Now that they are GA, they could work well as a complementary pattern for some variations of this problem. Specifically, if you already have a job writing match schedule data into a Delta table (rather than only PostgreSQL), you could use a table update trigger to kick off downstream ETL whenever new match records land.

Table update triggers support dynamic parameters like commit timestamps and updated table lists, which can be useful for filtering logic in downstream tasks. Key limitations to be aware of: maximum 10 tables per trigger, and for best performance you should enable file events on the external storage location.

That said, for your exact use case -- scheduling at a computed future time (kickoff + 105 min) rather than reacting immediately to a data change -- the self-rescheduling approach is still the right tool. Table triggers fire on data arrival, not at a computed offset from a timestamp value in the data.

Documentation: https://docs.databricks.com/aws/en/jobs/trigger-table-update

WEBHOOK AND API-BASED ALTERNATIVES

For anyone reading this who does have an external system that can push events (rather than store timestamps for polling), the Jobs API run-now endpoint (POST /api/2.2/jobs/run-now) is also worth considering. You can trigger a job run immediately and pass parameters:

https://docs.databricks.com/api/workspace/jobs/runnow

This works well when paired with cloud-native event systems (AWS EventBridge, Azure Event Grid, GCP Pub/Sub) that can call webhooks or Lambda functions to trigger jobs via the API. But as you noted, this requires external infrastructure to manage the timing, which is exactly what your pattern avoids.

RELEVANT DOCUMENTATION

- Scheduled triggers (CRON): https://docs.databricks.com/aws/en/jobs/scheduled
- Table update triggers: https://docs.databricks.com/aws/en/jobs/trigger-table-update
- File arrival triggers: https://docs.databricks.com/aws/en/jobs/file-arrival-triggers
- Jobs API - Update endpoint: https://docs.databricks.com/api/workspace/jobs/update
- Jobs API - Run Now endpoint: https://docs.databricks.com/api/workspace/jobs/runnow
- Databricks SDK for Python: https://docs.databricks.com/aws/en/dev-tools/sdk-python
- Quartz Cron syntax reference: http://www.quartz-scheduler.org/documentation/quartz-2.3.0/tutorials/crontrigger.html

Thanks for sharing this -- the pattern generalizes nicely beyond sports data to any domain with externally defined, irregular event timestamps (IoT maintenance windows, financial market closes, logistics ETAs, etc.).

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.