Databricks Community

jeremy98 · ‎04-02-2025

Hello Community,

I'm trying to create a job pipeline in Databricks that runs a spark_python_task, which executes a Python script configured with Hydra. The script's configuration file defines parameters, such as id.

How can I pass this parameter at the job level in Databricks so that the task picks it up and overrides it using Hydra? And how to use the dbutils.secrets.get through this type of spark_python_task, to retrieve the keys I need?

mark_ott · 3 weeks ago

You can pass and override configuration parameters for Hydra in a Databricks spark_python_task by specifying job-level parameters (as arguments) and using environment variables or Hydra’s command line overrides. For accessing secrets with dbutils.secrets.get, ensure your Python script is written to call this Databricks utility directly. Here’s how to achieve both:

Passing Hydra Parameters in Databricks Job

To override Hydra config values like id from a Databricks job:

Use the parameters field in the spark_python_task specification.
Hydra interprets CLI arguments (e.g., id=foobar) and environment variables as overrides.

Example Databricks job JSON:

json

{
  "tasks": [{
    "task_key": "my_hydra_task",
    "spark_python_task": {
      "python_file": "dbfs:/path/to/main.py",
      "parameters": ["id=1234"]
    }
  }]
}

This sends id=1234 to your script; Hydra will override the value if your script is structured to accept it from the CLI (main.py should invoke Hydra using @Hydra.main, which handles CLI overrides automatically).

Using dbutils.secrets.get in Your Script

Inside your Python script:

python

from pyspark.dbutils import DBUtils
# In Databricks, 'dbutils' is automatically available in notebook, for jobs:
dbutils = DBUtils(spark)

secret_value = dbutils.secrets.get(scope="my_scope", key="my_key")

Access your secret as shown, and use it wherever needed.

Example main.py Structure

python

import hydra
from omegaconf import DictConfig
from pyspark.dbutils import DBUtils

@Hydra.main(version_base=None, config_path="conf", config_name="config")
def main(cfg: DictConfig):
    dbutils = DBUtils(spark)
    secret = dbutils.secrets.get(scope="my_scope", key="my_key")
    print(f"Received id: {cfg.id}")
    print(f"Retrieved secret: {secret}")

if __name__ == "__main__":
    main()

The script will get id from the CLI/job configuration or config file (overridden if passed at job level).
dbutils.secrets.get retrieves secrets managed by Databricks.

Key Points

Use the parameters argument in Databricks jobs to override Hydra config variables.
Ensure your script’s main function uses @Hydra.main, so CLI overrides work.
Call dbutils.secrets.get directly in the script to read secrets—this works in .py files run by Databricks jobs.