cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Access task level parameters along with parameters passed by airflow job

divyab7
New Contributor II

I have a airflow DAG which calls databricks job that has a task level parameters defined as job_run_id (job.run_id) and has a type as python_script. When I try to access it using sys.argv and spark_python_task, it only prints the json that has passed through the airflow job. I want that sys should be able to get both the parameters passed by DAG and databricks job. 

We have a use case where we don't want to use anything related to dbutils. Its a python script so we want it to be independent of dbutils.

1 ACCEPTED SOLUTION

Accepted Solutions

Isi
Honored Contributor II

Hey @divyab7 

Sorry, now I understand better what you actually need. I got confused at first and thought you only wanted to access the parameters you pass through Airflow.

I think the dynamic identifiers that Databricks generates at runtime (like run IDs) are not injected there automatically.

I have been thinking in a way to get them without using dbutils is:

  • Job id  โ†’  you can extract it from spark.conf.get("spark.databricks.clusterUsageTags.clusterName"), which has a value like job-<job_id>-run-<task_run_id>.

  • Job run ID โ†’ once you have the job_id, you can call the Databricks Jobs API and retrieve the job_run_id.

 

This approach should work, but I agree itโ€™s not very straightforward. Databricks could definitely make it easier to expose these values directly in the runtime context instead of having to parse them or query the API.

Hope this helps, ๐Ÿ˜ฅ
Isi

View solution in original post

5 REPLIES 5

Isi
Honored Contributor II

Hey @divyab7 

Hi! I ran into the same thing. The short version is: for spark_python_task, the script only receives the arguments you send in the run payload, and Databricks does not automatically merge โ€œjob-levelโ€ parameters with the ones you pass at run time. What worked for me was to build the job dynamically from Airflow: I keep a small YAML (or dict) with the job defaults (cluster type, wheels, and also any default CLI args I want), and then, when the DAG runs, I merge those defaults with the DAGโ€™s dynamic values (like data_interval_start / data_interval_end). The result is a single, flat list of CLI parameters that I send in the parameters field of the run request.

This way, inside the Python script I donโ€™t rely on dbutils at all โ€” I just parse the CLI args and everything is there (both the job defaults and the DAG-specific values). The key point is that run-time parameters replace the jobโ€™s parameters unless you merge them yourself before submitting the run. This approach keeps the job configurable (cluster/image/wheels can change via config), and at the same time injects all execution info into the script in a simple, dependency-free way.

Tell me if you need more details, ๐Ÿ™‚

Isi

divyab7
New Contributor II

Thank you for your response. Can you please give me an example on how to implement this like should it be implemented in a certain way or do you have any code example?

divyab7
New Contributor II

My use case is we need job.run_id and we will only get this when the job is triggered and the python script invoked by databricks job needs it in order to move forward. I am still confused even if we merge it then how its going to replace dynamic value reference in databricks. Can you please provide me small code example?

Isi
Honored Contributor II

Hey @divyab7 

Sorry, now I understand better what you actually need. I got confused at first and thought you only wanted to access the parameters you pass through Airflow.

I think the dynamic identifiers that Databricks generates at runtime (like run IDs) are not injected there automatically.

I have been thinking in a way to get them without using dbutils is:

  • Job id  โ†’  you can extract it from spark.conf.get("spark.databricks.clusterUsageTags.clusterName"), which has a value like job-<job_id>-run-<task_run_id>.

  • Job run ID โ†’ once you have the job_id, you can call the Databricks Jobs API and retrieve the job_run_id.

 

This approach should work, but I agree itโ€™s not very straightforward. Databricks could definitely make it easier to expose these values directly in the runtime context instead of having to parse them or query the API.

Hope this helps, ๐Ÿ˜ฅ
Isi

divyab7
New Contributor II

This was really helpful. Thank you for the response ๐Ÿ˜Š

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now