cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Need help with setting up ForEach task in Databricks

Yuppp
New Contributor

Hi everyone,

I have a workflow involving two notebooks: Notebook A and Notebook B. At the end of Notebook A, we generate a variable number of files, let's call it N. I want to run Notebook B for each of these N files.

I know Databricks has a Foreach task that can iterate over a list of items.

Here's what I've tried so far

output_dir_paths = [<list of paths>]

dbutils.jobs.taskvalues.set(key="notebook_A_output_paths", value=output_dir_paths)

ForEach Loop:

For Each.jpg

The Task:

Task.jpg

In Notebook B, I'm attempting to read each path like this:

path = dbutils.widgets.get("single_batch_file")


Could someone please help me correct the code to pass the list of paths from Notebook A, iterate over each path, and send it to Notebook B?

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

You can use Databricks Workflows' foreach task to handle running Notebook B for each file generated in Notebook A. The key is to pass each path as a parameter to Notebook B using Databricks task values and workflows features, not widgets set manually. Here’s how you can structure this workflow step by step:

1. Notebook A: Produce and Pass Output

After you create your output paths (a Python list of strings), set them as a task value:

python
output_dir_paths = [...] # List of paths generated in Notebook A dbutils.jobs.taskValues.set(key="notebook_A_output_paths", value=output_dir_paths)

This persists the list for use in the job, not as widgets.


2. Workflow Configuration: Foreach (in Databricks Job UI)

  • Task A: Notebook A runs as the first step.

  • Task B: Downstream task set up as a “foreach” loop.

    • In the "items" field, reference the output from Notebook A:

      text
      {{tasks.taskA.taskValues.notebook_A_output_paths}}
    • Each iteration will pick one item (path) from this list and pass it as a parameter to Notebook B.

Set up an input parameter in Notebook B, for example named single_batch_file.


3. Notebook B: Receive and Use the Parameter

In Notebook B, you should register a widget with the same name as the parameter in the workflow:

python
dbutils.widgets.text("single_batch_file", "") path = dbutils.widgets.get("single_batch_file") print("Processing", path)

This retrieves the path for each parallel run from the foreach loop.


4. How the Data Flows

  • Notebook A emits the list via TaskValues.

  • Databricks Job picks up the list and the foreach splits it into N parallel runs, each with a different path.

  • Notebook B receives single_batch_file as a widget (from the job parameter) and processes accordingly.


Key Points

  • Don’t try to manually set widget values in A for use in B; use job parameters and TaskValues instead.

  • Always declare the widget in B using dbutils.widgets.text(...) so jobs can inject the parameter.

  • The naming (single_batch_file) in the workflow must match the widget name in the notebook.


References

  • Official Databricks documentation and best practices.


Summary Table

Step Action
Notebook A Write list of output paths; set using dbutils.jobs.taskValues.set
Workflow (UI) Set foreach loop; items = {{tasks.taskA.taskValues.notebook_A_output_paths}}
Notebook B Register widget single_batch_file and read it
 
 

This setup is scalable, robust, and leverages built-in Databricks Workflow best practices for passing dynamic file lists between notebook tasks.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now