Tuesday
Hi, I'm wondering if there is an easier way to accomplish this.
I can use Dynamic Value reference to pull the run_id of Parent 1 into Parent 2, however, what I'm looking for is for Child 1's task run_id to be referenced within Parent 2.
Currently I am considering using the databricks REST API to get the run_id of a notebook task (Child 1) that is nested inside a run_job task (Parent 1), that I can later reference in another run_job task downstream (Parent 2).
Would there be another/easier way of doing this?
yesterday
Hi, I would refer to the following cross-post for the solution.
As @emma_s points out, it basically boils down to:
1. Pass {{tasks.parent1.run_id}} to a downstream notebook via base_parameters
2. In that notebook, call get-output with that ID โ gives you run_job_output.run_id (the real parent1 run)
3. Call get-run on that โ find child1 in the tasks list โ grab its run_id
Basically, the part I was missing was getting the `run_job_output.run_id` with which to programmatically get the child1 run_id.
Wednesday
Hi โ good question. The cleanest way to do this is with task values, no REST API needed.
In Child 1's notebook, capture its own run_id and set it as a task value:
import json
ctx = json.loads(
dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson()
)
child1_run_id = ctx["currentRunId"]["id"]
dbutils.jobs.taskValues.set(key="child1_run_id", value=str(child1_run_id))
Then in your orchestrator job, when configuring Parent 2's job parameters, reference it with:
{{tasks.Parent1.values.child1_run_id}}
Task values set inside a child job are propagated back through the run_job task, so the orchestrator can access them via {{tasks.<run_job_task_name>.values.<key>}}.
As you noticed, {{tasks.Parent1.run_id}} gives you the orchestrator's task run_id for the runjob task itself โ not the child job's internal task runid. That's why task values are the right tool here: they let the child task explicitly publish its own metadata for upstream consumption.
If you can't modify Child 1's notebook, then yes, the REST API approach works:
But if you can add a couple of lines to Child 1, the task values approach is simpler and avoids API calls entirely.
Docs:
Hope that helps!
Wednesday
I'm sorry, but there's a couple of things I need to call out in your response @anuj_lathi .
Attached images for reference.
Wednesday
Hi @ChristianRRL you're absolutely right, and I apologize for the earlier suggestion. I've verified that task values from child jobs are not propagated back through run_job tasks.
Your instinct about the REST API was correct. Here's the fix:
Orchestrator:
โโโ Parent1 (run_job)
โโโ get_child_run_id (notebook task) โ NEW, depends on Parent1
โโโ Parent2 (run_job, depends on get_child_run_id)
Notebook (`get_child_run_id`):
import requests
host = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
headers = {"Authorization": f"Bearer {token}"}
# Get orchestrator run โ find Parent1 โ get child job run_id โ find Child1
job_run_id = spark.conf.get("spark.databricks.job.runId")
orch_run = requests.get(f"{host}/api/2.1/jobs/runs/get",
headers=headers, params={"run_id": job_run_id}).json()
parent1 = next(t for t in orch_run["tasks"] if t["task_key"] == "Parent1")
child_run = requests.get(f"{host}/api/2.1/jobs/runs/get-output",
headers=headers, params={"run_id": parent1["run_id"]}).json()
child_job_run_id = child_run["metadata"]["run_id"]
child_job = requests.get(f"{host}/api/2.1/jobs/runs/get",
headers=headers, params={"run_id": child_job_run_id}).json()
child1 = next(t for t in child_job["tasks"] if t["task_key"] == "Child1")
dbutils.jobs.taskValues.set(key="child1_run_id", value=str(child1["run_id"]))
Then in Parent 2, reference: {{tasks.get_child_run_id.values.child1_run_id}}
Can you check if this works?
Apologies again for the earlier response.
Regards,
Anuj
Wednesday
Ok, I think this makes much more sense and I see how it could work.
The only thing I would say as far as how we're trying to implement this, rather than having the intermediary `get_child_run_id` notebook task, I am trying to get the `child_run_id` inside of the parent2 run_job rather than it being passed via the intermediary step. We can still accomplish this by including the dynamic value reference `{{tasks.parent1.run_id}}` as a job parameter for parent2. This way, once the parent2 run_job has the parent1 run_id, we can follow similar steps as what you outlined.
Thank you for your assistance! This helps confirm my solution path.
Thursday
Sorry, I have to mark this as not solution again. But I think the issue is becoming clearer. Please see my attached images.
Basically, the issue I'm having is that it seems like I can only get the "Child" task run_id for the task at the orchestrator level (e.g. run_fleet_wtg_ge_silver).. however, this run_id is different than the actual nested run_id of the launched job and its respective task. Because a run_job task *launches* a separate instance of that job, I am not able to get the nested job > task run_id I need.
Put another way, what I have is:
Let me know if this makes sense. This is trickier than I was originally thinking.
yesterday
Hi, I would refer to the following cross-post for the solution.
As @emma_s points out, it basically boils down to:
1. Pass {{tasks.parent1.run_id}} to a downstream notebook via base_parameters
2. In that notebook, call get-output with that ID โ gives you run_job_output.run_id (the real parent1 run)
3. Call get-run on that โ find child1 in the tasks list โ grab its run_id
Basically, the part I was missing was getting the `run_job_output.run_id` with which to programmatically get the child1 run_id.