Need help with setting up ForEach task in Databricks

Yuppp
New Contributor

Hi everyone,

I have a workflow involving two notebooks: Notebook A and Notebook B. At the end of Notebook A, we generate a variable number of files, let's call it N. I want to run Notebook B for each of these N files.

I know Databricks has a Foreach task that can iterate over a list of items.

Here's what I've tried so far

output_dir_paths = [<list of paths>]

dbutils.jobs.taskvalues.set(key="notebook_A_output_paths", value=output_dir_paths)

ForEach Loop:

For Each.jpg

The Task:

Task.jpg

In Notebook B, I'm attempting to read each path like this:

path = dbutils.widgets.get("single_batch_file")


Could someone please help me correct the code to pass the list of paths from Notebook A, iterate over each path, and send it to Notebook B?

mark_ott
Databricks Employee
Databricks Employee

You can use Databricks Workflows' foreach task to handle running Notebook B for each file generated in Notebook A. The key is to pass each path as a parameter to Notebook B using Databricks task values and workflows features, not widgets set manually. Here’s how you can structure this workflow step by step:

1. Notebook A: Produce and Pass Output

After you create your output paths (a Python list of strings), set them as a task value:

python
output_dir_paths = [...] # List of paths generated in Notebook A dbutils.jobs.taskValues.set(key="notebook_A_output_paths", value=output_dir_paths)

This persists the list for use in the job, not as widgets.


2. Workflow Configuration: Foreach (in Databricks Job UI)

  • Task A: Notebook A runs as the first step.

  • Task B: Downstream task set up as a “foreach” loop.

    • In the "items" field, reference the output from Notebook A:

      text
      {{tasks.taskA.taskValues.notebook_A_output_paths}}
    • Each iteration will pick one item (path) from this list and pass it as a parameter to Notebook B.

Set up an input parameter in Notebook B, for example named single_batch_file.


3. Notebook B: Receive and Use the Parameter

In Notebook B, you should register a widget with the same name as the parameter in the workflow:

python
dbutils.widgets.text("single_batch_file", "") path = dbutils.widgets.get("single_batch_file") print("Processing", path)

This retrieves the path for each parallel run from the foreach loop.


4. How the Data Flows

  • Notebook A emits the list via TaskValues.

  • Databricks Job picks up the list and the foreach splits it into N parallel runs, each with a different path.

  • Notebook B receives single_batch_file as a widget (from the job parameter) and processes accordingly.


Key Points

  • Don’t try to manually set widget values in A for use in B; use job parameters and TaskValues instead.

  • Always declare the widget in B using dbutils.widgets.text(...) so jobs can inject the parameter.

  • The naming (single_batch_file) in the workflow must match the widget name in the notebook.


References

  • Official Databricks documentation and best practices.


Summary Table

Step Action
Notebook A Write list of output paths; set using dbutils.jobs.taskValues.set
Workflow (UI) Set foreach loop; items = {{tasks.taskA.taskValues.notebook_A_output_paths}}
Notebook B Register widget single_batch_file and read it
 
 

This setup is scalable, robust, and leverages built-in Databricks Workflow best practices for passing dynamic file lists between notebook tasks.