cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Complex pipeline with many tasks and dependencies: orchestration in Jobs or in Notebook?

smoortema
New Contributor II

We need to set up a job that consists of several hundred tasks with many dependencies between each other. We are considering two different directions:

1. Databricks job with tasks, with dependencies defined as code and deployed with Databricks asset bundles. The resulting job would probably be too complicated to visualise in jobs GUI. However, it is possible to break it down to subjobs, and then a main job orchestrating between the subjobs.

2. One single job, where a notebook does the orchestration, evaluates which task's dependencies are met and then runs it by calling another notebook or by giving parameters to a second, executor task whose job would be to run the notebook. Dependencies are stored in a table that is queried by the orchestrator notebook. This task would run parallelly in multiple threads to make sure that not only one task is run at the same time. Parallelism can be created in Databricks job or in Python.

With option 1, we have the following questions:

  • Is there a maximum number of dependencies that a task can get and that jobs can then handle without performance overhead?
  • If there is a certain task whose dependencies are not met, does this waiting reserve capacity that could be used elsewhere if this waiting did not need to happen? Or does the orchestration reallocate resources intelligently, and instead of waiting, it starts another task whose dependencies are already met? (Compare it with option 2 where the task selected by the orchestrator notebook is always a task that can be run instantly so there is no waiting.)
  • More generally, how many threads is a single job run on? For example, if there are 100 tasks without dependencies in a job, how many of them are started at the same time and how does it depend on the cluster it runs on and the number of workers?

With option 2, we have the following questions:

  • If the parallelism is defined in a foreach task, is it possible to implement a do... until logic? We do not know upfront how many times does the orchestrator needs to run. If the answer is no, then the loop needs to be also in Python.
  • In option 2, there is only one job running with one or two tasks. Some (or most) of the Databricks logging and monitoring function that goes with jobs is lost, and it has to be programmatically created. How big price is this to pay?

Additionally to the above questions, any general advice in the topic is also welcome!

1 ACCEPTED SOLUTION

Accepted Solutions

radothede
Valued Contributor II

Hi @smoortema 

To my best knowlegde:

Option 1)

You can create jobs that contain up to 1000 tasks, however, it is recommended to split tasks into logical subgroups.

Jobs with more than 100 tasks require API 2.2 and above Jobs with a large number of tasks | Databricks Documentation.

For max concurrency limitations I would recommend to chceck max concurrent runs limitations. 

You should also consider maximum number of tasks running at the same time in the context of cluster size you will be using. That also depends on your workload - is it critical, SLA's to be met, cost vs execution time trade-off, etc - this is something to be tested for Your scenario.

Pros:

- better observability (gui, split jobs/tasks to logical subgroups)

- better resources allocation and cost optimisation (you can define separate clusters that fits each task/s or jobs needs, enable photon engine, use job cluster or shared cluster)

- built-in fault tolerance (retry logic, rerunning failed tasks)

- asset bundles CICD,

- built-in dependencies support.

Cons:

- changes require full bundle redeployment (may be possibly handled by CICD custom logic with bundle separation)

- GUI becomes unwieldy with hundreds of tasks if task not splitted into logical subgroups

- there might be "some" overhead when jobs include complex dependencies and lots of tasks

Recommendations:

 

  • Consider using job clusters vs shared clusters based on your workload - test and compare the costs of both
  • Set reasonable concurrency limits to avoid resource contention

 

Option 2)

Pros:

Custom logic with "metadata" approach - gives you endless flexibility, but may get complex and hard to maintain in time.

Cons:

No built-in support for dependencies and fault-tolerance (single point of failure).

When to use:

 

  • f your dependencies are highly dynamic and change frequently at runtime
  • If you need very specific custom orchestration logic that Databricks jobs can't handle
  • If you have severe cost constraints and need maximum resource efficiency
  • If your team has strong expertise in building distributed systems, deep understanding of Spark execution engine

Summary:

In general, I would highly recommend option 1) for production workloads.

Test (!!!) your approach and adjust the setup to Your project needs.

Best,

Radek.

View solution in original post

1 REPLY 1

radothede
Valued Contributor II

Hi @smoortema 

To my best knowlegde:

Option 1)

You can create jobs that contain up to 1000 tasks, however, it is recommended to split tasks into logical subgroups.

Jobs with more than 100 tasks require API 2.2 and above Jobs with a large number of tasks | Databricks Documentation.

For max concurrency limitations I would recommend to chceck max concurrent runs limitations. 

You should also consider maximum number of tasks running at the same time in the context of cluster size you will be using. That also depends on your workload - is it critical, SLA's to be met, cost vs execution time trade-off, etc - this is something to be tested for Your scenario.

Pros:

- better observability (gui, split jobs/tasks to logical subgroups)

- better resources allocation and cost optimisation (you can define separate clusters that fits each task/s or jobs needs, enable photon engine, use job cluster or shared cluster)

- built-in fault tolerance (retry logic, rerunning failed tasks)

- asset bundles CICD,

- built-in dependencies support.

Cons:

- changes require full bundle redeployment (may be possibly handled by CICD custom logic with bundle separation)

- GUI becomes unwieldy with hundreds of tasks if task not splitted into logical subgroups

- there might be "some" overhead when jobs include complex dependencies and lots of tasks

Recommendations:

 

  • Consider using job clusters vs shared clusters based on your workload - test and compare the costs of both
  • Set reasonable concurrency limits to avoid resource contention

 

Option 2)

Pros:

Custom logic with "metadata" approach - gives you endless flexibility, but may get complex and hard to maintain in time.

Cons:

No built-in support for dependencies and fault-tolerance (single point of failure).

When to use:

 

  • f your dependencies are highly dynamic and change frequently at runtime
  • If you need very specific custom orchestration logic that Databricks jobs can't handle
  • If you have severe cost constraints and need maximum resource efficiency
  • If your team has strong expertise in building distributed systems, deep understanding of Spark execution engine

Summary:

In general, I would highly recommend option 1) for production workloads.

Test (!!!) your approach and adjust the setup to Your project needs.

Best,

Radek.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now