Databricks Community

smoortema · ‎02-16-2026

I want to build a job or a set of jobs that load data from several different source systems (with approximately 600 tasks in total). Data from the source systems arrive at different points in time every day, and a query needs to be executed to see if the data for the day has already arrived. First there are tasks that use only one source system, then there are tasks depending on these first tasks that combine data from the different source systems. I see two alternatives here:

1. Create one big job with all the tasks and dependencies set up within the job. Create a waiting task that checks regularly if the given source system data has already arrived. If yes, then the task succeeds and the dependent task can start.

The potential advantage I see is that all the dependencies can be viewed, monitored and maintained within a single master job.

The potential disadvantage is that the job needs to run for hours, just waiting for the data and running the query in a forever loop, which can be costly.

2. Create separate jobs for the tasks belonging to separate source systems, and implement the trigger logic in a separate, small job running a notebook. This triggering job starts every x minutes (or runs continuously on a very small cluster), runs the necessary queries and triggers any jobs that can already start. This triggering job would also take care of the dependencies between jobs.

Here the advantage is that the master job does not need to run idly for hours.

The disadvantage is that dependencies are handled in two different ways, within jobs in Databricks, and between jobs in a notebook. Monitoring and maintaning is more difficult.

What do you think, which is the better way to implement this? Are there any more pros and cons for these options or any other alternative you would recommend?

saurabh18cs · ‎02-17-2026

below wud be my approach rather than making on big monolithic job , cost and not maintainable at all.

Each source system has its own ingestion job (or a pipeline).
Each ingestion job runs on a schedule or via trigger. we can also make use of autoloader with managed file events along with file arrival trigger and serverlss to accomplish low latency if source files are raw files ??
When data arrives → that job writes Delta & emits a signal (Delta table update, event etc).
A central “orchestrator job” (or Lakeflow pipeline) reacts to events and triggers dependent jobs.

View solution in original post

saurabh18cs · ‎02-17-2026

below wud be my approach rather than making on big monolithic job , cost and not maintainable at all.

Each source system has its own ingestion job (or a pipeline).
Each ingestion job runs on a schedule or via trigger. we can also make use of autoloader with managed file events along with file arrival trigger and serverlss to accomplish low latency if source files are raw files ??
When data arrives → that job writes Delta & emits a signal (Delta table update, event etc).
A central “orchestrator job” (or Lakeflow pipeline) reacts to events and triggers dependent jobs.

Databricks Community

Implement waiting logic in Databricks jobs

The Next Wave of Enterprise AI | Webinar

🌟 Community Pulse: Your Weekly Roundup! June 29 – July 05, 2026

📌‌ Complete Your Profile – Help Others Get to Know You

Solution Accelerator Series | Identify Fraud With Geospatial Analytics and AI

Databricks Community Champion - June 2026 - Amira Bedhiafi