cancel
Showing results for 
Search instead for 
Did you mean: 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to actually get job_id and run_id in a Databricks Python wheel task (Avoid Hallucinations)

Kirankumarbs
Contributor

We needed job_id and run_id in a custom metrics Delta table so we could join to `system.lakeflow.job_run_timeline`. Tried four approaches before finding the one that works on serverless compute.

What doesn't work

spark.conf.get("spark.databricks.job.id")
Throws CONFIG_NOT_AVAILABLE on serverless. This key exists in classic compute but not in the Spark Connect protocol.

os.environ["DATABRICKS_JOB_ID"]
Not a real env var. Databricks sets `DATABRICKS_RUNTIME_VERSION` and cluster lib paths, but nothing with job identity.

dbutils.notebook.entry_point.getDbutils().notebook().getContext()
Works on notebook tasks. Fails in Python wheel tasks with the module has no attribute 'notebook'.

spark_env_vars with {{job.id}}
Dynamic value references don't resolve in spark_env_vars. The value passes through as the literal string {{job.id}}.

What works

Job-level parameters with dynamic value references, piped into task named_parameters:

parameters:
- name: job_id
default: "{{job.id}}"
- name: run_id
default: "{{job.run_id}}"

tasks:
- python_wheel_task:
named_parameters:
job_id: "{{job.parameters.job_id}}"
run_id: "{{job.parameters.run_id}}"

Values arrive as sys.argv. Parse with argparse:

import argparse, sys

parser = argparse.ArgumentParser()
parser.add_argument("--job_id", type=int, default=None)
parser.add_argument("--run_id", type=int, default=None)
args, _ = parser.parse_known_args(sys.argv[1:])

Bonus: dbruntime.databricks_repl_context also works

from dbruntime.databricks_repl_context import get_context
ctx = get_context()
job_id = ctx.jobId
run_id = ctx.idInJob

Undocumented but functional in both script and wheel tasks on serverless. We went with `named_parameters` because it's the documented approach.

How I figured this out

Wrote a 30-line test script that dumps sys.argv, all env vars, spark conf, and dbutils context. Created a Databricks job with job parameters set to {{job.id}} and {{job.run_id}}. Ran it once. The output showed exactly which sources had real values and which were empty.

Sometimes the fastest path to the answer is the oldest trick: print everything, read the output.

Full blog post with the story behind these findings: link

0 REPLIES 0