cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Dynamic Number of Tasks in Databricks Workflow

FlexException
New Contributor II

Do Databricks workflows support creating a workflow with a dynamic number of tasks?

For example, let's say we have a DAG like this:

T1 ->    T2(1) ->

             T2(2) ->

              .....                 -> T3

             T2(n-1) ->

             T2(n) ->

In this case task 1 (T1) executes first and creates 1...n tasks (T2) that can execute in parallel. Then once all of those T2 tasks finish, a T3  task can run.

Here you don't know up front how many T2 tasks will exist up front because they rely on the output of T1 (which will change). In my specific case, T1 will generate n rows with each row providing the parameter needed to query a different server in T2.

5 REPLIES 5

Kaniz
Community Manager
Community Manager

Hi @FlexException, Databricks Workflows provide a powerful way to orchestrate data processing, machine learning, and analytics pipelines on the Databricks Data Intelligence Platform.

 

Let's explore how you can achieve your dynamic task creation scenario:

 

Task 1 (T1): This initial task generates n rows, each containing parameters needed for querying different servers in T2.

Task 2 (T2): You want to create n parallel tasks based on the output of T1. These tasks will be executed concurrently. However, the exact number of T2 tasks is not known upfront.

Task 3 (T3): Once all T2 tasks finish, the final T3 task can run.

 

To achieve this dynamic task creation, consider the following steps:

 

Databricks Jobs: Use Databricks Jobs to run non-interactive code in your Databricks workspace. You can create a job that orchestrates the entire workflow. Within this job, you can programmatically create T2 tasks based on the output of T1. Databricks provides a user-friendly UI and API for building and monitoring jobs, making it accessibl....

 

Task Orchestration: Databricks allows you to efficiently orchestrate tasks in a Directed Acyclic Graph (DAG). You can define dependencies between tasks, ensuring that T2 tasks wait for the completion of T1 before execution. The dynamic creation of T2 tasks can be handled programmatically within your job.

 

Parameterization: Since T1 generates n rows with different parameters, you can pass these parameters as inputs to the dynamically created T2 tasks. Databricks supports parameterization, allowing you to customize task behaviour based on input values.

 

Remember that Databricks simplifies the creation, management, and workflows, making it an excellent choice for scenarios like yours.

 

By leveraging Databricks Jobs and task orchestration, you can achieve the desired dynamic task creation while maintaining flexibility and scalability.

 

Feel free to explore Databricks documentation for detailed examples and best practices on creating a....

 

Happy orchestrating! 🚀

FlexException
New Contributor II

@Kanizreally appreciate your reply.

I've looked through those blog posts and do not see any examples of achieving the dynamic task creation where I can retain the visibility you get through workflows across all the tasks.

Could you point me to an example?

All of the links that you posted lead to the same blog post from mid 2021, maybe I'm missing a key piece here.

 

Thanks!

FlexException
New Contributor II

@Kanizwould you happen to have more documentation on the process you outlined to achieve the dynamic task creation?

Hey, @FlexException

Given you can create and update workflows dynamically using their JSON, it may turn out it is a matter of playing around with a JSON. See the blog below - let me know if it helps. Be cautious of how you would manage all of these dynamic changes though:

https://medium.com/@rishianand.nits/create-update-databricks-workflow-dynamically-abba4b0916b8

tanyeesern
New Contributor II

@FlexException Databricks API supports job creation and execution Task Parameters and Values in Databricks Workflows | by Ryan Chynoweth | Medium
One possibility is after running earlier job, process the output to create a dynamic number of tasks in subsequent job

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.