<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Orchestrating Irregular Databricks Jobs from external source Timestamps in Community Articles</title>
    <link>https://community.databricks.com/t5/community-articles/orchestrating-irregular-databricks-jobs-from-external-source/m-p/149759#M1048</link>
    <description>&lt;P&gt;Thanks&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/217611"&gt;@PiotrPustola&lt;/a&gt;, for sharing such an interesting problem!&lt;/P&gt;&lt;P&gt;We currently use file-based and table-based triggers in our production setup, but it’s always good to know about other possibilities and approaches like this.&lt;/P&gt;</description>
    <pubDate>Wed, 04 Mar 2026 09:55:30 GMT</pubDate>
    <dc:creator>Kirankumarbs</dc:creator>
    <dc:date>2026-03-04T09:55:30Z</dc:date>
    <item>
      <title>Orchestrating Irregular Databricks Jobs from external source Timestamps</title>
      <link>https://community.databricks.com/t5/community-articles/orchestrating-irregular-databricks-jobs-from-external-source/m-p/149261#M1039</link>
      <description>&lt;P&gt;&lt;EM&gt;Works for any event-driven workload: IoT alerts, e-commerce flash sales, financial market close processing.&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT size="5"&gt;&lt;STRONG&gt;Goal&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;In this project, I needed to start Databricks jobs on an irregular basis, driven entirely by timestamps stored in PostgreSQL rather than by a fixed schedule.&lt;/P&gt;&lt;P&gt;The concrete use case was processing football match data right after the final whistle. Because matches have irregular kick-off times and variable durations, it was not possible to define a simple, fixed schedule for job runs. Instead, the requirement was to trigger a job at `match_start_time + 105 minutes`, which corresponds to the earliest possible end time of a match.&lt;/P&gt;&lt;P&gt;&lt;FONT size="5"&gt;&lt;STRONG&gt;Challenge&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;On a previous project, I had solved a similar problem using Apache Airflow. There, I could rely on a timetable schedule: an external source with a collection of timestamps that defines when a DAG (Airflow job) should be triggered. Airflow would automatically poll that source and trigger the DAG at the appropriate times.&lt;/P&gt;&lt;P&gt;When I moved to Databricks, I could not find a direct equivalent of this timetable-style scheduling. Out of the box, Databricks jobs can be triggered in the following ways:&lt;/P&gt;&lt;P&gt;- &lt;A href="https://docs.databricks.com/aws/en/jobs/scheduled" target="_blank" rel="noopener"&gt;Using a CRON schedule&lt;/A&gt;&lt;BR /&gt;- &lt;A href="https://docs.databricks.com/aws/en/jobs/file-arrival-triggers" target="_blank" rel="noopener"&gt;On file arrival&lt;/A&gt;&lt;BR /&gt;- &lt;A href="https://docs.databricks.com/aws/en/jobs/trigger-table-update" target="_blank" rel="noopener"&gt;On table update&lt;/A&gt; (not yet available at the time of this implementation)&lt;BR /&gt;- &lt;A href="https://docs.databricks.com/api/workspace/jobs/runnow" target="_blank" rel="noopener"&gt;Via REST API&lt;/A&gt;&lt;/P&gt;&lt;P&gt;None of these options provided the same "pull timestamps from an external list and schedule dynamically" behavior that I previously had with Airflow timetables.&lt;/P&gt;&lt;P&gt;&lt;FONT size="5"&gt;&lt;STRONG&gt;Idea: A Self-Rescheduling Orchestrator Job&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;To bridge this gap, I designed an orchestrator job responsible for triggering ETL jobs at the specific timestamps stored in PostgreSQL.&lt;/P&gt;&lt;P&gt;The high-level approach was:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;When the orchestrator job runs, it queries PostgreSQL for match data based on the job execution timestamp&lt;/LI&gt;&lt;LI&gt;It selects all matches that start around "now" plus a time buffer (for example, all matches that start at the current time plus some offset)&lt;/LI&gt;&lt;LI&gt;For each such game, the orchestrator triggers the appropriate ETL jobs&lt;/LI&gt;&lt;LI&gt;After triggering those ETL jobs, the orchestrator reschedules **itself** to run at the next relevant timestamp taken from PostgreSQL&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;What happens if there are no upcoming matches in PostgreSQL?&lt;/STRONG&gt;&lt;/U&gt;&lt;BR /&gt;There is one more component running at fixed schedules: a job that fetches new match datetimes and populates PostgreSQL. It is also responsible for checking whether the orchestrator is scheduled; if not, it schedules it for the earliest upcoming match.&lt;/P&gt;&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;Why not run-now?&lt;/STRONG&gt;&lt;/U&gt; A natural alternative might be using the REST API run-now endpoint in a polling loop from an external scheduler. However, cron updates are preferable here because they enable precise, native scheduling without external timers or continuous polling.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Second-level precision&lt;/STRONG&gt;: Cron updates let Databricks handle exact timing using its managed scheduler—no drift from external cron jobs or polling intervals&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Zero idle compute&lt;/STRONG&gt;: Unlike polling every X minutes, the job sleeps until the exact timestamp, minimizing costs&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Fault tolerance&lt;/STRONG&gt;: Databricks retries failed schedules automatically; external polling would need custom retry logic&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Simplicity:&lt;/STRONG&gt;&amp;nbsp;One job self-manages its lifecycle vs. maintaining separate poller + trigger infrastructure.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;This keeps everything within Databricks while achieving Airflow timetable-like behavior.&lt;/P&gt;&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;Why not use newer Databricks features&lt;/STRONG&gt;&lt;/U&gt; like table triggers or &lt;A href="https://www.databricks.com/product/data-engineering/spark-declarative-pipelines" target="_blank" rel="noopener"&gt;Spark Declarative Pipelines&lt;/A&gt;?&lt;BR /&gt;By the time of implementation, table update triggers weren't available. Now they enable event-driven job execution when Delta/Iceberg tables are updated (merge/delete), with dynamic parameters like commit timestamp. However, they require mirroring PostgreSQL data into Delta tables first, adding sync overhead and latency—unsuitable for direct timestamp-driven scheduling from an external relational database.&lt;/P&gt;&lt;P&gt;Spark Declarative Pipelines (part of Lakeflow) excel at no-code ETL orchestration with automated lineage and retries, but their scheduling still relies on CRON, file arrival, or table triggers—not dynamic querying of external PostgreSQL timestamps to compute precise execution windows.&lt;/P&gt;&lt;P&gt;This self-rescheduling is the key part of the solution. It turns the job into a precise, event-driven loop that reprograms its next execution time according to the latest information in PostgreSQL—without requiring continuous running and polling of the database, which would incur unnecessary costs and wasted compute time, unlike a static CRON expression defined once.&lt;/P&gt;&lt;P&gt;&lt;FONT size="5"&gt;&lt;STRONG&gt;Implementation: Updating the Job Schedule via REST API&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;Rescheduling the job is done through the Databricks &lt;A href="https://docs.databricks.com/api/workspace/introduction" target="_blank" rel="noopener"&gt;REST API&lt;/A&gt;, using the `&lt;A href="https://docs.databricks.com/api/workspace/jobs/update" target="_blank" rel="noopener"&gt;/api/2.2/jobs/update&lt;/A&gt;`&amp;nbsp;endpoint and providing a new schedule in the `new_settings` object.&lt;/P&gt;&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;Converting Python datetime to Quartz Cron format&lt;/STRONG&gt;&lt;/U&gt;&lt;BR /&gt;Databricks uses Quartz Cron expressions for schedules, which follow the format: `[second] [minute] [hour] [day of month] [month] [day of week] [year]`. I created a helper function to convert a Python `datetime.datetime` object to this format:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;def datetime_to_quartz_cron(dt: datetime) -&amp;gt; str:
    """Convert datetime to Quartz cron expression.
    Quartz cron format: [second] [minute] [hour] [day of month] [month] [day of week] [year]
    http://www.quartz-scheduler.org/documentation/quartz-2.3.0/tutorials/crontrigger.html

    Parameters
    ----------
    dt : datetime.datetime
        Datetime to convert

    Returns
    -------
    str
        Quartz cron expression
    """
    return f"{dt.second} {dt.minute} {dt.hour} {dt.day} {dt.month} ? {dt.year}"&lt;/LI-CODE&gt;&lt;P&gt;The core Python snippet that performs the reschedule looks like this:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import CronSchedule, JobSettings

new_cron_schedule = CronSchedule(
                quartz_cron_expression=new_schedule_date_quartz_cron_format,
                timezone_id="UTC",
            )
new_settings = JobSettings(schedule=new_cron_schedule)

workspace_client = WorkspaceClient(**config)
workspace_client.jobs.update(job_id=job_id, new_settings=new_settings)&lt;/LI-CODE&gt;&lt;P&gt;In this pattern:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;The orchestrator computes `new_schedule_date_quartz_cron_format` (using the helper function above) based on the next batch of matches stored in PostgreSQL&lt;/LI&gt;&lt;LI&gt;It then constructs a new `CronSchedule` and wraps it into `JobSettings`&lt;/LI&gt;&lt;LI&gt;Finally, it calls `jobs.update` for its own `job_id`, effectively changing its next trigger time to align with the next match window&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;This provides a flexible, data-driven schedule: as soon as match data changes in PostgreSQL, the orchestrator will adapt its future runs accordingly, without manual intervention or static CRON maintenance.&lt;/P&gt;</description>
      <pubDate>Wed, 25 Feb 2026 08:34:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/orchestrating-irregular-databricks-jobs-from-external-source/m-p/149261#M1039</guid>
      <dc:creator>PiotrPustola</dc:creator>
      <dc:date>2026-02-25T08:34:53Z</dc:date>
    </item>
    <item>
      <title>Re: Orchestrating Irregular Databricks Jobs from external source Timestamps</title>
      <link>https://community.databricks.com/t5/community-articles/orchestrating-irregular-databricks-jobs-from-external-source/m-p/149759#M1048</link>
      <description>&lt;P&gt;Thanks&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/217611"&gt;@PiotrPustola&lt;/a&gt;, for sharing such an interesting problem!&lt;/P&gt;&lt;P&gt;We currently use file-based and table-based triggers in our production setup, but it’s always good to know about other possibilities and approaches like this.&lt;/P&gt;</description>
      <pubDate>Wed, 04 Mar 2026 09:55:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/orchestrating-irregular-databricks-jobs-from-external-source/m-p/149759#M1048</guid>
      <dc:creator>Kirankumarbs</dc:creator>
      <dc:date>2026-03-04T09:55:30Z</dc:date>
    </item>
    <item>
      <title>Re: Orchestrating Irregular Databricks Jobs from external source Timestamps</title>
      <link>https://community.databricks.com/t5/community-articles/orchestrating-irregular-databricks-jobs-from-external-source/m-p/150150#M1060</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/217611"&gt;@PiotrPustola&lt;/a&gt; -- The self-rescheduling orchestrator pattern is a really elegant solution for event-driven workloads that depend on externally managed timestamps. A few thoughts and additions that might help you and others who land on this article:&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;ADDITIONAL TIPS FOR THE SELF-RESCHEDULING PATTERN&lt;/P&gt;
&lt;P&gt;1. Pause status awareness: When you update the schedule via the Jobs API, make sure to explicitly set pause_status to "UNPAUSED" in the CronSchedule object. If a job was paused for any reason (manual intervention, maintenance, etc.), just updating the quartz_cron_expression alone will not unpause it. Here is an example:&lt;/P&gt;
&lt;P&gt;new_cron_schedule = CronSchedule(&lt;BR /&gt;quartz_cron_expression=new_schedule_date_quartz_cron_format,&lt;BR /&gt;timezone_id="UTC",&lt;BR /&gt;pause_status=PauseStatus.UNPAUSED&lt;BR /&gt;)&lt;/P&gt;
&lt;P&gt;2. Idempotency and race conditions: If the orchestrator fails partway through (after triggering ETL jobs but before rescheduling itself), you could end up in a state where it never runs again. A good safety net is a separate lightweight "watchdog" job on a fixed daily CRON that checks whether the orchestrator has a valid future schedule, and reschedules it if not. It sounds like your match-fetching job already handles this, which is a solid design.&lt;/P&gt;
&lt;P&gt;3. Scheduler latency: The Databricks documentation notes that the job scheduler is not designed for sub-second precision and may experience delays of up to several minutes due to infrastructure conditions. For your use case (triggering 105 minutes after kickoff) this is fine, but anyone adapting this pattern for tighter timing windows should be aware of this.&lt;/P&gt;
&lt;P&gt;4. Using the Reset vs Update endpoint: You correctly use the Update endpoint (POST /api/2.2/jobs/update), which only modifies the fields you specify. Be careful not to accidentally use the Reset endpoint (POST /api/2.2/jobs/reset), which overwrites the entire job configuration and could wipe out task definitions, cluster settings, and other configuration if you only pass in the schedule.&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;TABLE UPDATE TRIGGERS AS A COMPLEMENT&lt;/P&gt;
&lt;P&gt;You mentioned that table update triggers were not available at the time of your implementation. Now that they are GA, they could work well as a complementary pattern for some variations of this problem. Specifically, if you already have a job writing match schedule data into a Delta table (rather than only PostgreSQL), you could use a table update trigger to kick off downstream ETL whenever new match records land.&lt;/P&gt;
&lt;P&gt;Table update triggers support dynamic parameters like commit timestamps and updated table lists, which can be useful for filtering logic in downstream tasks. Key limitations to be aware of: maximum 10 tables per trigger, and for best performance you should enable file events on the external storage location.&lt;/P&gt;
&lt;P&gt;That said, for your exact use case -- scheduling at a computed future time (kickoff + 105 min) rather than reacting immediately to a data change -- the self-rescheduling approach is still the right tool. Table triggers fire on data arrival, not at a computed offset from a timestamp value in the data.&lt;/P&gt;
&lt;P&gt;Documentation: &lt;A href="https://docs.databricks.com/aws/en/jobs/trigger-table-update" target="_blank"&gt;https://docs.databricks.com/aws/en/jobs/trigger-table-update&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;WEBHOOK AND API-BASED ALTERNATIVES&lt;/P&gt;
&lt;P&gt;For anyone reading this who does have an external system that can push events (rather than store timestamps for polling), the Jobs API run-now endpoint (POST /api/2.2/jobs/run-now) is also worth considering. You can trigger a job run immediately and pass parameters:&lt;/P&gt;
&lt;P&gt;&lt;A href="https://docs.databricks.com/api/workspace/jobs/runnow" target="_blank"&gt;https://docs.databricks.com/api/workspace/jobs/runnow&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;This works well when paired with cloud-native event systems (AWS EventBridge, Azure Event Grid, GCP Pub/Sub) that can call webhooks or Lambda functions to trigger jobs via the API. But as you noted, this requires external infrastructure to manage the timing, which is exactly what your pattern avoids.&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;RELEVANT DOCUMENTATION&lt;/P&gt;
&lt;P&gt;- Scheduled triggers (CRON): &lt;A href="https://docs.databricks.com/aws/en/jobs/scheduled" target="_blank"&gt;https://docs.databricks.com/aws/en/jobs/scheduled&lt;/A&gt;&lt;BR /&gt;- Table update triggers: &lt;A href="https://docs.databricks.com/aws/en/jobs/trigger-table-update" target="_blank"&gt;https://docs.databricks.com/aws/en/jobs/trigger-table-update&lt;/A&gt;&lt;BR /&gt;- File arrival triggers: &lt;A href="https://docs.databricks.com/aws/en/jobs/file-arrival-triggers" target="_blank"&gt;https://docs.databricks.com/aws/en/jobs/file-arrival-triggers&lt;/A&gt;&lt;BR /&gt;- Jobs API - Update endpoint: &lt;A href="https://docs.databricks.com/api/workspace/jobs/update" target="_blank"&gt;https://docs.databricks.com/api/workspace/jobs/update&lt;/A&gt;&lt;BR /&gt;- Jobs API - Run Now endpoint: &lt;A href="https://docs.databricks.com/api/workspace/jobs/runnow" target="_blank"&gt;https://docs.databricks.com/api/workspace/jobs/runnow&lt;/A&gt;&lt;BR /&gt;- Databricks SDK for Python: &lt;A href="https://docs.databricks.com/aws/en/dev-tools/sdk-python" target="_blank"&gt;https://docs.databricks.com/aws/en/dev-tools/sdk-python&lt;/A&gt;&lt;BR /&gt;- Quartz Cron syntax reference: &lt;A href="http://www.quartz-scheduler.org/documentation/quartz-2.3.0/tutorials/crontrigger.html" target="_blank"&gt;http://www.quartz-scheduler.org/documentation/quartz-2.3.0/tutorials/crontrigger.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Thanks for sharing this -- the pattern generalizes nicely beyond sports data to any domain with externally defined, irregular event timestamps (IoT maintenance windows, financial market closes, logistics ETAs, etc.).&lt;/P&gt;
&lt;P&gt;* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.&lt;/P&gt;
&lt;P&gt;* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.&lt;/P&gt;</description>
      <pubDate>Sun, 08 Mar 2026 05:01:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/orchestrating-irregular-databricks-jobs-from-external-source/m-p/150150#M1060</guid>
      <dc:creator>SteveOstrowski</dc:creator>
      <dc:date>2026-03-08T05:01:19Z</dc:date>
    </item>
  </channel>
</rss>

