cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks orchestration job

maikel
Contributor II

Hello Community,

We are currently building a system in Databricks where multiple tasks are combined into a single job that produces final output data.

So far, our approach is based on Python notebooks (with asset bundles) that orchestrate the workflow. Each notebook calls functions from separate Python modules responsible for smaller processing steps. We can unit test the Python modules without issues, but testing the notebook logic itself is challenging. At the moment, the only way to validate the full flow is to run everything directly in Databricks.

Because of this limitation, we are considering replacing notebooks with pure Python files. Before making this change, I have a few questions:

  1. How can variables be passed between tasks when using pure Python files?
    I’m familiar with passing variables between notebook tasks, but I’m unsure how this would work with Python scripts.

  2. What is the recommended approach for writing end-to-end (E2E) integration tests for a Databricks job consisting of multiple tasks?

  3. What is the general recommendation — notebooks or pure Python files?
    Regardless of the option, what are the main benefits and trade-offs of each approach?

I would appreciate any insights or best practices based on your experience.

Thank you!

 

1 ACCEPTED SOLUTION

Accepted Solutions

aleksandra_ch
Databricks Employee
Databricks Employee

Hi @maikel ,

  1. In order to pass dynamic parameters between Python script tasks:
    1. In the upstream task (named "task_1"), set the dynamic parameter via dbutils:
      from databricks.sdk.runtime import *
      dbutils.jobs.taskValues.set(key = "fave_food", value = "beans")​
    2. In the downstream task, set the input parameter from the upstream task, as explained here:
      Screenshot 2026-02-17 at 17.25.03.png
    3. In the downstream Python task itself, the dynamic parameter is passed as command-line argument:
      import argparse
      
      p = argparse.ArgumentParser()
      p.add_argument("-input_dynamic_param")
      args = p.parse_args()
      print(args.input_dynamic_param)​


  2. A typical integration test of the workflow would be:
    1. Deploy the workflow via Databricks Asset Bundles (to a separate integration/staging workspace, or to a separate target in the DAB definition).
    2. Run the workflow on a subset of data.
    3. Output the result into a separate catalog / schema.
    4. Optionally, add an additional step to the workflow to compare results with the ground truth. 
    5. Ensure that the workflow deployment, input and output data are isolated from other workloads.
  3. There is no general recommendation on whether to choose Python scripts or Notebooks - it all depends on your team's habits and overall practices:
    1. Notebooks give richer experience (Markdown, widgets, magic commands).
    2. Note also that you can save Notebooks as plain python scripts and run them locally (if the code doesn't depend on some rich experience).
    3. Also, you can run Databricks notebooks directly via your local IDE with Databricks connect.
    4. Please note that for Lakeflow Spark Declarative Pipelines it's different. Python files are strongly recommended over Notebooks in that case.

Hope it helps.

Best regards,

View solution in original post

3 REPLIES 3

aleksandra_ch
Databricks Employee
Databricks Employee

Hi @maikel ,

  1. In order to pass dynamic parameters between Python script tasks:
    1. In the upstream task (named "task_1"), set the dynamic parameter via dbutils:
      from databricks.sdk.runtime import *
      dbutils.jobs.taskValues.set(key = "fave_food", value = "beans")​
    2. In the downstream task, set the input parameter from the upstream task, as explained here:
      Screenshot 2026-02-17 at 17.25.03.png
    3. In the downstream Python task itself, the dynamic parameter is passed as command-line argument:
      import argparse
      
      p = argparse.ArgumentParser()
      p.add_argument("-input_dynamic_param")
      args = p.parse_args()
      print(args.input_dynamic_param)​


  2. A typical integration test of the workflow would be:
    1. Deploy the workflow via Databricks Asset Bundles (to a separate integration/staging workspace, or to a separate target in the DAB definition).
    2. Run the workflow on a subset of data.
    3. Output the result into a separate catalog / schema.
    4. Optionally, add an additional step to the workflow to compare results with the ground truth. 
    5. Ensure that the workflow deployment, input and output data are isolated from other workloads.
  3. There is no general recommendation on whether to choose Python scripts or Notebooks - it all depends on your team's habits and overall practices:
    1. Notebooks give richer experience (Markdown, widgets, magic commands).
    2. Note also that you can save Notebooks as plain python scripts and run them locally (if the code doesn't depend on some rich experience).
    3. Also, you can run Databricks notebooks directly via your local IDE with Databricks connect.
    4. Please note that for Lakeflow Spark Declarative Pipelines it's different. Python files are strongly recommended over Notebooks in that case.

Hope it helps.

Best regards,

maikel
Contributor II

Hello @aleksandra_ch,

thanks a lot for your response! Very helpful! One thing I would like to ask - by Lakeflow Spark Declarative Pipelines do you mean the chain of jobs to perform some data engineering operations?

Thank you!

aleksandra_ch
Databricks Employee
Databricks Employee

Hi @maikel ,

Happy to help! By Lakeflow Spark Declarative Pipelines (SDP) I mean using the SDP framework instead of plain PySpark / SQL. Check here for more details:

  1. https://docs.databricks.com/aws/en/ldp/
  2. Spark Declarative Pipelines “How-To” Series. Part 1: How to Save Results Into A Table 

Best regards,