Hi @smoortema
To my best knowlegde:
Option 1)
You can create jobs that contain up to 1000 tasks, however, it is recommended to split tasks into logical subgroups.
Jobs with more than 100 tasks require API 2.2 and above Jobs with a large number of tasks | Databricks Documentation.
For max concurrency limitations I would recommend to chceck max concurrent runs limitations.
You should also consider maximum number of tasks running at the same time in the context of cluster size you will be using. That also depends on your workload - is it critical, SLA's to be met, cost vs execution time trade-off, etc - this is something to be tested for Your scenario.
Pros:
- better observability (gui, split jobs/tasks to logical subgroups)
- better resources allocation and cost optimisation (you can define separate clusters that fits each task/s or jobs needs, enable photon engine, use job cluster or shared cluster)
- built-in fault tolerance (retry logic, rerunning failed tasks)
- asset bundles CICD,
- built-in dependencies support.
Cons:
- changes require full bundle redeployment (may be possibly handled by CICD custom logic with bundle separation)
- GUI becomes unwieldy with hundreds of tasks if task not splitted into logical subgroups
- there might be "some" overhead when jobs include complex dependencies and lots of tasks
Recommendations:
- Consider using job clusters vs shared clusters based on your workload - test and compare the costs of both
- Set reasonable concurrency limits to avoid resource contention
Option 2)
Pros:
Custom logic with "metadata" approach - gives you endless flexibility, but may get complex and hard to maintain in time.
Cons:
No built-in support for dependencies and fault-tolerance (single point of failure).
When to use:
- f your dependencies are highly dynamic and change frequently at runtime
- If you need very specific custom orchestration logic that Databricks jobs can't handle
- If you have severe cost constraints and need maximum resource efficiency
- If your team has strong expertise in building distributed systems, deep understanding of Spark execution engine
Summary:
In general, I would highly recommend option 1) for production workloads.
Test (!!!) your approach and adjust the setup to Your project needs.
Best,
Radek.