cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Passing Parameters *between* Workflow run_job steps

ChristianRRL
Honored Contributor

Hi there, I'm trying to reference a task value - let's call it `output_path` (not known until programmatically generated by the code) - that is created in a nested task (Child 1) within a run_job (Parent 1) as an input parameter - let's call it `input_path` - for a downstream run_job (Parent 2). I understand that due to the way variable scoping works, this may not be typically possible and am looking into some possible ways to do this.

Some approaches I'm considering currently:

  • Create a "placeholder" task or run_job parameter variable that is updated by the nested task (Child 1)
    • Pro: explicit and clear reference of variable
    • Con: more challenging to scale + seems a bit brittle
  • Use the REST API `/api/2.2/jobs/runs/get-output` to set & get the variable
  • Pro: overall seems easier to scale
    • Con: more challenging to implement + it requiring the value to be passed through `dbutils.notebook.exit()` seems a bit limiting

Please let me know if there are other/better approaches I may not be considering, or else if one of the above options is generally more or less recommended.

 

NOTE: Trying to paste an image, but lately the paste functionality has not been working. Attached a reference image as well in case the image paste didn't go through

1 REPLY 1

ChristianRRL
Honored Contributor

Quick update, my question effectively boils down to:

Do databricks workflows have "global" variables that can be set programmatically from anywhere in the workflow (e.g. nested notebook task inside a parent run_job task) during runtime and be referenced anywhere else in the workflow, regardless of scope?

Consulting with LLMs, I have some partial answers but still would appreciate some feedback from the community!

Updates on my considered approaches:

  • The first option I think wouldn't work as I was hoping due to variable scoping
  • The second option seems like it's still a viable option, but the same challenges/trickiness persist
  • Other options I've seen proposed elsewhere:
    • DBFS/Cloud Storage (e.g. file with runtime information saved and referenced elsewhere during job run)
    • External DB/Table (e.g. tasks read/write key-value pairs to a shared Delta table or external database)