3 weeks ago
Hello,
I am using Databricks Asset bundels to create jobs for machine learning pipelines.
My problem is I am using SparkPython taks and defining params inside those. When the job is created it is created with some params. When I want to run the same job with different params. I need to update the job and run.
So first question : Is it a good practice to do it like this ?
Second question: I saw that a solution is to use params jobs and reference them in the tasks. And those params can be updated only during the run.
This approach works but I don't like the fact of giving the whole params to the job and not by tasks. Is there a better way of doing this ?
3 weeks ago
So you want to pass task parameters instead of job parameters when running the bundle ?
3 weeks ago
Yes - looking for a method to do it - I found this thread discussing the topic : https://community.databricks.com/t5/data-engineering/quot-run-now-with-different-parameters-quot-dif...
2 weeks ago
Hi @Dali1,
Great questions -- parameterizing ML pipelines in DABs is something a lot of people wrestle with, so let me break down the options.
THE SHORT ANSWER
No, you should not have to update the job definition every time you want different parameters. Databricks supports runtime parameter overrides, and DABs has first-class support for this via the "databricks bundle run" command.
OPTION 1: JOB-LEVEL PARAMETERS WITH RUNTIME OVERRIDES (RECOMMENDED)
This is the approach you discovered, and it is actually the recommended pattern. You define parameters at the job level with sensible defaults, and then each task references them using dynamic value references ({{job.parameters.<name>}}). At runtime, you override only the ones you need.
In your bundle YAML:
resources:
jobs:
ml_training_job:
name: ml-training-pipeline
parameters:
- name: "learning_rate"
default: "0.01"
- name: "epochs"
default: "100"
- name: "dataset_path"
default: "/mnt/data/training"
tasks:
- task_key: train_model
spark_python_task:
python_file: ./train.py
parameters:
- "{{job.parameters.learning_rate}}"
- "{{job.parameters.epochs}}"
- "{{job.parameters.dataset_path}}"
- task_key: evaluate_model
spark_python_task:
python_file: ./evaluate.py
parameters:
- "{{job.parameters.dataset_path}}"
depends_on:
- task_key: train_model
Run with different params (no job update needed):
databricks bundle run --params learning_rate=0.001,epochs=200 ml_training_job
In your Python script, access these as command-line arguments:
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--learning_rate", type=float, default=0.01)
parser.add_argument("--epochs", type=int, default=100)
parser.add_argument("--dataset_path", type=str)
args = parser.parse_args()
Key advantage: You define the params once, reference them across multiple tasks, and override them at runtime without touching the job definition.
Docs:
- Parameterize jobs: https://docs.databricks.com/jobs/parameters
- Dynamic value references: https://docs.databricks.com/jobs/dynamic-value-references
- bundle run command: https://docs.databricks.com/dev-tools/cli/bundle-commands
OPTION 2: TASK-LEVEL PARAMETERS (WITHOUT JOB PARAMETERS)
If you truly want per-task parameters and your tasks do not share parameters, you can skip job-level params entirely and define parameters directly on each spark_python_task. These are passed as command-line arguments to your script.
resources:
jobs:
ml_pipeline:
name: ml-pipeline
tasks:
- task_key: train_model
spark_python_task:
python_file: ./train.py
parameters:
- "--learning_rate"
- "0.01"
- "--epochs"
- "100"
- task_key: evaluate_model
spark_python_task:
python_file: ./evaluate.py
parameters:
- "--threshold"
- "0.85"
IMPORTANT: Job parameters and task parameters are MUTUALLY EXCLUSIVE. If your job has job-level "parameters" defined, you cannot use task-level parameter overrides at runtime, and vice versa. The CLI will throw an error if you mix them.
Docs:
- Add tasks to jobs in DABs: https://docs.databricks.com/dev-tools/bundles/job-task-types
- Access parameter values: https://docs.databricks.com/jobs/parameter-use
OPTION 3: BUNDLE VARIABLES (DEPLOY-TIME, NOT RUN-TIME)
DABs also has custom variables (${var.<name>}), but these are resolved at DEPLOY time, not run time. They are useful for environment-specific config (dev vs. prod cluster IDs, catalog names, etc.), but they are NOT what you want for ML hyperparameters that change between runs.
variables:
env:
default: "dev"
catalog:
default: "ml_dev"
resources:
jobs:
ml_pipeline:
name: ml-pipeline-${var.env}
tasks:
- task_key: train
spark_python_task:
python_file: ./train.py
parameters:
- "--catalog"
- "${var.catalog}"
Override at deploy time:
databricks bundle deploy --var="env=prod,catalog=ml_prod"
These get baked into the job definition at deployment. Changing them requires redeployment.
Docs: https://docs.databricks.com/dev-tools/bundles/variables
MY RECOMMENDATION FOR YOUR ML PIPELINE
Use a hybrid approach:
1. Job-level parameters for anything that changes between runs (hyperparameters, dataset paths, experiment names). These can be overridden with "databricks bundle run --params" without redeploying.
2. Bundle variables for anything that changes between environments but stays constant across runs (cluster IDs, catalog/schema names, storage paths).
3. Not every parameter needs to be at the job level -- only the ones you want to vary at runtime. Parameters that are truly task-specific and never change can stay hardcoded in the task's parameters list.
The fact that job parameters are defined in one place and referenced via {{job.parameters.<name>}} actually makes them easier to manage than scattering parameters across individual tasks, especially when multiple tasks share the same values (like a dataset path used by both training and evaluation).
I hope this helps! Let me know if you have questions about any of these approaches.
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.
2 weeks ago
Yes I would like to use option 2 with task specific parameters but Is it possible to override them at runtime using the API databricks ?
The solution i found was to define job parameters that are referenced by task params like this it worked. If there is a better approach I will take it