cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Passing Parameters *between* Workflow run_job steps

ChristianRRL
Honored Contributor

Hi there, I'm trying to reference a task value - let's call it `output_path` (not known until programmatically generated by the code) - that is created in a nested task (Child 1) within a run_job (Parent 1) as an input parameter - let's call it `input_path` - for a downstream run_job (Parent 2). I understand that due to the way variable scoping works, this may not be typically possible and am looking into some possible ways to do this.

Some approaches I'm considering currently:

  • Create a "placeholder" task or run_job parameter variable that is updated by the nested task (Child 1)
    • Pro: explicit and clear reference of variable
    • Con: more challenging to scale + seems a bit brittle
  • Use the REST API `/api/2.2/jobs/runs/get-output` to set & get the variable
  • Pro: overall seems easier to scale
    • Con: more challenging to implement + it requiring the value to be passed through `dbutils.notebook.exit()` seems a bit limiting

Please let me know if there are other/better approaches I may not be considering, or else if one of the above options is generally more or less recommended.

 

NOTE: Trying to paste an image, but lately the paste functionality has not been working. Attached a reference image as well in case the image paste didn't go through

2 ACCEPTED SOLUTIONS

Accepted Solutions

Hi @ChristianRRL,

No. As of now, Lakeflow Jobs doesn’t provide global, mutable variables that you can set from any task and read from any other task, regardless of scope. This is a current limitation of the platform... 

I think you’ve already explored the supported patterns (job parameters, task values, etc.). I'm assuming you have a reason to keep the computation inside a separate child job. If so, the most robust option is to persist output_path to an external store (for example, a Delta table or a Unity Catalog volume / external location) in the child job. In the parent job, add a notebook task that reads that value and re-exposes it via dbutils.jobs.taskValues.set, and then reference it in downstream tasks using a dynamic value reference like {{tasks.<task_name>.values.output_path}}.

Using GET /api/2.1/jobs/runs/get-output doesn’t give you a global variable either. It’s read-only in the sense you can’t set a variable in Lakeflow Jobs with it. It works best for an external orchestrator pattern (external code runs Parent 1, calls get-output, then starts Parent 2 with that value as a job parameter).

Avoid using workspace DBFS for this kind of cross-job state. Prefer Unity Catalog managed storage instead.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

Hi @ChristianRRL,

No. Lakeflow Jobs don’t support a child job/task setting or updating a parent job’s task values. 

dbutils.jobs.taskValues.set() always writes a value for the current task in the current job run. There is no way to target a different task or a different job (like the Run Job parent).

Run Job creates a separate job run. Its task values remain scoped to that child job and cannot become the task values of the parent’s Run Job task, nor can they be read by a sibling Run Job (your Parent 2).

To get your pattern working, you still need to either move Child 1 into the same Lakeflow job as Parent 2 and use task values normally, or have Child 1 persist output_path to UC-managed storage, then in the parent job read it and re-expose it via dbutils.jobs.taskValues.set, which Parent 2 and its children can then reference.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

4 REPLIES 4

ChristianRRL
Honored Contributor

Quick update, my question effectively boils down to:

Do databricks workflows have "global" variables that can be set programmatically from anywhere in the workflow (e.g. nested notebook task inside a parent run_job task) during runtime and be referenced anywhere else in the workflow, regardless of scope?

Consulting with LLMs, I have some partial answers but still would appreciate some feedback from the community!

Updates on my considered approaches:

  • The first option I think wouldn't work as I was hoping due to variable scoping
  • The second option seems like it's still a viable option, but the same challenges/trickiness persist
  • Other options I've seen proposed elsewhere:
    • DBFS/Cloud Storage (e.g. file with runtime information saved and referenced elsewhere during job run)
    • External DB/Table (e.g. tasks read/write key-value pairs to a shared Delta table or external database)

Hi @ChristianRRL,

No. As of now, Lakeflow Jobs doesn’t provide global, mutable variables that you can set from any task and read from any other task, regardless of scope. This is a current limitation of the platform... 

I think you’ve already explored the supported patterns (job parameters, task values, etc.). I'm assuming you have a reason to keep the computation inside a separate child job. If so, the most robust option is to persist output_path to an external store (for example, a Delta table or a Unity Catalog volume / external location) in the child job. In the parent job, add a notebook task that reads that value and re-exposes it via dbutils.jobs.taskValues.set, and then reference it in downstream tasks using a dynamic value reference like {{tasks.<task_name>.values.output_path}}.

Using GET /api/2.1/jobs/runs/get-output doesn’t give you a global variable either. It’s read-only in the sense you can’t set a variable in Lakeflow Jobs with it. It works best for an external orchestrator pattern (external code runs Parent 1, calls get-output, then starts Parent 2 with that value as a job parameter).

Avoid using workspace DBFS for this kind of cross-job state. Prefer Unity Catalog managed storage instead.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

I think this makes a lot of sense.

One follow-up question, do Lakeflow Jobs in some way support the ability for a Child task to create or update a parent task value? An example in the context I shared earlier:

  • We have a Lakeflow Job with two sequential run_job steps
  • The first run_job (let's call it Parent 1) has a notebook task (let's call it Child 1)
  •  Somehow (if possible) I would like the notebook task to use `dbutils.jobs.taskValues.set()` referencing the Parent 1 as the task
  • If this can be achieved, when we zoom out the Parent 1 has the taskValue already set, and can then be referenced in the following run_job (let's call it Parent 2)
  • Once Parent 2 starts running, all Child 2 related tasks would have access to the referenced variable

Let me know if this makes sense and if it's even possible?

Hi @ChristianRRL,

No. Lakeflow Jobs don’t support a child job/task setting or updating a parent job’s task values. 

dbutils.jobs.taskValues.set() always writes a value for the current task in the current job run. There is no way to target a different task or a different job (like the Run Job parent).

Run Job creates a separate job run. Its task values remain scoped to that child job and cannot become the task values of the parent’s Run Job task, nor can they be read by a sibling Run Job (your Parent 2).

To get your pattern working, you still need to either move Child 1 into the same Lakeflow job as Parent 2 and use task values normally, or have Child 1 persist output_path to UC-managed storage, then in the parent job read it and re-expose it via dbutils.jobs.taskValues.set, which Parent 2 and its children can then reference.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***