Databricks

GGG_P · ‎04-28-2023

I'm using Python (as Python wheel application) on Databricks.

I deploy & run my jobs using dbx.

I defined some Databricks Workflow using Python wheel tasks.

Everything is working fine, but I'm having issue to extract "databricks_job_id" & "databricks_run_id" for logging/monitoring purpose.

I'm used to defined {{job_id}} & {{run_id}} as parameter in "Notebook Task" or other task type, its works fine.

But with Python wheel I'm not able to define theses :

With Python wheel task, parameters are basically an array of string :

["/dbfs/Shared/dbx/projects/myproject/66655665aac24e748d4e7b28c6f4d624/artifacts/myparameter.yml","/dbfs/Shared/dbx/projects/myproject/66655665aac24e748d4e7b28c6f4d624/artifacts/conf"]

Adding "{{job_id}}" & "{{run_id}}" in this array doesn't seems to work ...

Do you have any ideas ? Don't want to use any REST API during my workload just to extract theses ids...

I guess that I cannot use dbutils / notebook context to got thoses IDs since I don't use any notebooks ...

Anonymous · ‎05-13-2023

@Grégoire PORTIER :

You can use the dbutils module to retrieve the job ID and run ID from within your Python wheel application. Here's an example of how you can do this:

from pyspark.sql import SparkSession
import requests
import json
import os
 
# Get the current SparkSession
spark = SparkSession.builder.getOrCreate()
 
# Get the Databricks job ID and run ID from the environment variables
job_id = os.environ.get("DATABRICKS_JOB_ID")
run_id = os.environ.get("DATABRICKS_RUN_ID")
 
# Print the job ID and run ID for logging/monitoring purposes
print(f"Databricks Job ID: {job_id}")
print(f"Databricks Run ID: {run_id}")

You can then add this code to your Python wheel task to extract the job ID and run ID and use them for logging/monitoring purposes.

Note that the environment variables DATABRICKS_JOB_ID and DATABRICKS_RUN_ID are automatically set by Databricks when you run a job, so you don't need to pass them as parameters.