I want to build a job or a set of jobs that load data from several different source systems (with approximately 600 tasks in total). Data from the source systems arrive at different points in time every day, and a query needs to be executed to see if the data for the day has already arrived. First there are tasks that use only one source system, then there are tasks depending on these first tasks that combine data from the different source systems. I see two alternatives here:
1. Create one big job with all the tasks and dependencies set up within the job. Create a waiting task that checks regularly if the given source system data has already arrived. If yes, then the task succeeds and the dependent task can start.
The potential advantage I see is that all the dependencies can be viewed, monitored and maintained within a single master job.
The potential disadvantage is that the job needs to run for hours, just waiting for the data and running the query in a forever loop, which can be costly.
2. Create separate jobs for the tasks belonging to separate source systems, and implement the trigger logic in a separate, small job running a notebook. This triggering job starts every x minutes (or runs continuously on a very small cluster), runs the necessary queries and triggers any jobs that can already start. This triggering job would also take care of the dependencies between jobs.
Here the advantage is that the master job does not need to run idly for hours.
The disadvantage is that dependencies are handled in two different ways, within jobs in Databricks, and between jobs in a notebook. Monitoring and maintaning is more difficult.
What do you think, which is the better way to implement this? Are there any more pros and cons for these options or any other alternative you would recommend?