Author: @shwetav1407
Tags: #workflows, #orchestration, #jobs
Welcome to the blog series exploring Databricks Workflows, a powerful product for orchestrating data processing, machine learning, and analytics pipelines on the Databricks Data Intelligence Platform. Here, We will dive into key feature that brings flexibility and reusability to your pipelines: Workflow Parameters
Introduction - Why Parameters Matter
When managing complex data workflows, efficiency and flexibility are your most valuable tools. Imagine being able to tailor every step of your data pipeline - every notebook, every transformation, every model - to perfectly fit your needs at runtime. With workflow parameters in Databricks, you can do exactly that.
Workflow parameters act as dynamic inputs that help guide execution across your notebooks and tasks. Rather than hardcoding values, you can adjust variables on the fly based on data, runtime conditions, or the output of upstream tasks — letting your workflows adapt intelligently to any situation. Whether you are running ETL pipelines, training machine learning models, or scheduling nightly jobs, parameters help you streamline operations and maximize efficiency.
There are four foundational concepts for parameterizing workflows:
In this blog, we explore all four concepts through a real-world healthcare pipeline, showing you how to make your workflows smarter, more adaptive, and easier to maintain.
Benefits of Workflow Parameterization
Workflow parameters are central to building data pipelines that are efficient, maintainable, and adaptive. Here are the primary benefits:
In essence, parameters transform static, rigid systems into dynamic, adaptable workflows that scale with confidence.
Types of Workflow Parameters
Job Parameters
Job parameters are key-value pairs defined at the job level. When a job runs, these parameters are automatically pushed down to all compatible tasks. They are ideal for controlling settings that apply across the entire job — things like environment names, processing dates, or file paths.
Why they are important :
Job parameters let you rerun jobs with different inputs without touching any code. They are especially powerful in CI/CD pipelines, where the same job definition runs in dev, staging, and prod by simply swapping a single parameter.
|
Important: Job parameters take precedence over task parameters. If a job parameter and a task parameter share the same key, the job parameter value wins. |
Use Cases :
Example - JSON job configuration:
"parameters": {
"env": "dev",
"raw_table": "sales_raw"
}
These job parameters can also be configured directly within the Databricks Job UI under the Job Parameters section.
Retrieve in your notebook:
env = dbutils.widgets.get("env")
raw_table = dbutils.widgets.get("raw_table")
You can also configure job parameters directly in the Databricks UI: navigate to your job, open the Job details sidebar, and click Edit parameters. Use the { } button to browse and insert available dynamic value references.
Task Parameters
Task parameters are key-value pairs (or JSON arrays) defined at the individual task level. Unlike job parameters — which apply globally — task parameters let you customize the behavior of each task independently within a multi-task workflow.
Why They're Important:
Task parameters are essential when different stages of a pipeline need to behave differently. You can control task-specific logic, toggle configurations, or enable/disable operations without affecting the rest of the pipeline. How task parameters are passed to the underlying asset depends on the task type — notebook tasks use dbutils.widgets, while Python script tasks receive them as command-line arguments.
Use Cases:
Example - JSON task configuration :
{
"task_key": "load",
"notebook_path": "/etl/load",
"parameters": {
"write_mode": "append"
}
}
You can retrieve task parameters inside a notebook using Databricks widgets like this:
write_mode = dbutils.widgets.get("write_mode")
Dynamic Value Reference
Dynamic Value References use a {{ }} double-curly-brace syntax to inject runtime information into task configurations. Rather than passing static values, these references resolve automatically at execution time — pulling in job metadata, trigger details, timestamps, or the output of upstream tasks.
Why they are important :
Dynamic value references make your workflows responsive to the conditions at runtime. They are invaluable for audit trails, conditional logic, chaining task outputs, and building backfill-capable pipelines.
Commonly Used References:
|
Reference |
Description |
|
{{job.id}} |
The unique identifier of the job |
|
{{job.run_id}} |
The unique identifier of the job run |
|
{{job.name}} |
The job name at time of run |
|
{{job.start_time.iso_date}} |
The run start date (UTC, ISO format) |
|
{{task.name}} |
The name of the current task |
|
{{task.run_id}} |
The unique identifier of the task run |
|
{{tasks.<task_name>.result_state}} |
Result state of an upstream task (success, failed, etc.) |
|
{{tasks.<task_name>.error_code}} |
Error code for a failed upstream task |
|
{{tasks.<task_name>.values.<key>}} |
A task value published by an upstream task |
|
{{backfill.iso_date}} |
The ISO date for a backfill job run |
|
{{workspace.id}} |
The unique identifier of the workspace |
Example — passing a dynamic reference as a task parameter:
"proc_date": "{{job.start_time.iso_date}}"
You can retrieve this dynamic value inside a downstream notebook using:
bronze_path = dbutils.jobs.taskValues.get(taskKey="snapshot_claims", key="bronze_path")
Task Values
Task values are a mechanism for capturing values produced during task execution and making them available to downstream tasks. Unlike parameters — which are set before a run — task values are written during execution and read by subsequent tasks at runtime.
Why they are important :
Task values enable dynamic chaining between pipeline stages. Rather than duplicating logic or hardcoding table names, tasks can publish outputs that downstream tasks consume directly — keeping pipelines DRY (Don't Repeat Yourself) and robust across environments.
Set a task value (in the producing task):
dbutils.jobs.taskValues.set(key="bronze_path", value=bronze_table)
Reference in a downstream task configuration:
"bronze_path": "{{tasks.snapshot_claims.values.bronze_path}}"
Retrieve in the consuming notebook:
bronze_path = dbutils.jobs.taskValues.get(
taskKey="snapshot_claims",
key="bronze_path"
)
|
Tip: In addition to task values, SQL tasks can pass their query output to downstream tasks via {{tasks.<task_name>.output.rows}} or {{tasks.<task_name>.output.first_row.<column>}}. This is particularly useful for feeding dynamic lists into For each task. |
Use Case: HealthVerify Claims Risk ETL Pipeline
Every night, a national payor receives millions of raw claim lines. Business leaders want to know by the next morning which patients are trending toward high cost — so outreach nurses can intervene before the next expensive hospital visit. Auditors insist that every number published can be reproduced in court five years from now. And data scientists iterate on risk formulas weekly and cannot wait for DevOps tickets.
Those three requirements — speed, auditability, and agility — shape the pipeline we are about to build. We will use Databricks Workflow Parameters to:
The pipeline consists of four tasks:
|
Task |
Layer |
Key Parameter Type Used |
|
01_snapshot_claims |
Bronze |
Job parameters |
|
02_enrich_claims |
Silver |
Task values (dynamic reference) |
|
03_score_risk |
Gold |
Task parameter (scoring_strategy) |
|
04_surface_alerts |
Alerts |
Task parameter + task value |
|
Note: Tasks are prefixed with a running sequence (01_ through 04_) to make ordering and references explicit. |
We will meet all four requirements with a four‑task Databricks Workflow. Each task highlights a different flavour of parameterization.
Note: For easier understanding, we added a running sequence as a prefix for each task, for example, 01_* till 04_*.
Step 1 - Capturing the Daily’s Raw Data (Bronze)
When the workflow kicks off, the first task (01_snapshot_claims) captures an immutable record of truth for the processing date. It reads the job parameter proc_date (for example, 2025-06-22) and the job parameter env to determine whether data should land in the dev or prod database schema.
Once the Bronze snapshot is written, the notebook immediately publishes the fully qualified table name as a task value:
# 01_snapshot_claims
proc_date = dbutils.widgets.get("proc_date")
env = dbutils.widgets.get("env")
bronze_table = f"sv_catalog.healthverity_{env}.bronze_claims_{proc_date.replace('-', '')}"
raw_df = spark.read.table("source_catalog.claims.raw_claims_feed") \
.filter(F.col("proc_date") == proc_date)
raw_df.write.format("delta").mode("overwrite").saveAsTable(bronze_table)
dbutils.jobs.taskValues.set(key="bronze_path", value=bronze_table)
Publishing bronze_path here eliminates two brittle alternatives: duplicating the Silver notebook per environment, or guessing the table name by parsing strings. Downstream tasks simply ask the workflow for the value.
Step 2 - Deduplicate, Clean and Enrich (Silver)
The second task (02_enrich_claims) picks up exactly where Bronze left off. Databricks injects the bronze_path task value via a dynamic reference ({{tasks.01_snapshot_claims.values.bronze_path}}) into the task's widget configuration, so the notebook always operates on the correct table regardless of environment.
# 02_enrich_claims
bronze_path = dbutils.jobs.taskValues.get(
taskKey="01_snapshot_claims",
key="bronze_path"
)
bronze_df = spark.table(bronze_path)
cleaned_df = (
bronze_df
.dropDuplicates(["claim_id", "hvid"]) # remove re-processing duplicates
.filter(F.col("diagnosis_code").isNotNull()) # drop rows missing mandatory fields
.withColumn("age", 2025 - F.col("patient_year_of_birth").cast("int"))
)
silver_table = f"sv_catalog.healthverity.silver_claims_enriched"
cleaned_df.write.format("delta").mode("overwrite").saveAsTable(silver_table)
dbutils.jobs.taskValues.set(key="silver_path", value=silver_table)
The notebook performs three operations: deduplication (removing rows sharing claim_id and hvid), validation (discarding rows with missing diagnosis_code), and enrichment (converting patient_year_of_birth into a ready-to-use age column). The cleaned data is written to silver_claims_enriched, and silver_path is published for the next stage.
Step 3 - Score Patient Risk (Gold)
The Gold notebook (03_score_risk) contains the analytical core of the pipeline. It pulls silver_path dynamically via a task value, and reads a task parameter scoring_strategy that controls which risk formula to apply:
# 03_score_risk
scoring_strategy = dbutils.widgets.get("scoring_strategy")
silver_df = spark.table(
dbutils.jobs.taskValues.get(taskKey="02_enrich_claims", key="silver_path")
)
chronic_conditions = ["F1020", "F32A", "E119", "F410"]
agg = (
silver_df
.withColumn("chronic", F.col("diagnosis_code").isin(chronic_conditions))
.groupBy("hvid")
.agg(
F.first("age").alias("age"),
F.sum(F.col("chronic").cast("int")).alias("chronic_dx_count")
)
)
if scoring_strategy == "v1":
agg = agg.withColumn("risk_score", F.col("chronic_dx_count") * 0.3 + F.col("age") * 0.01)
elif scoring_strategy == "v2":
agg = agg.withColumn("risk_score", F.col("chronic_dx_count") * 0.5 + F.col("age") * 0.02)
else:
agg = agg.withColumn("risk_score", F.col("chronic_dx_count") * 0.4)
agg = agg.withColumn(
"risk_level",
F.when(F.col("risk_score") > 0.8, "Critical")
.when(F.col("risk_score") > 0.5, "High")
.otherwise("Medium")
)
gold_path = "sv_catalog.healthverity.gold_patient_risk_scores"
agg.write.format("delta").mode("overwrite").saveAsTable(gold_path)
dbutils.jobs.taskValues.set(key="kpi_path", value=gold_path)
Using the selected strategy, the notebook computes a risk_score between 0 and 1, and derives a risk_level (Critical, High, Medium). The results are then written to gold_patient_risk_scores.
By parameterizing the strategy, analysts can run side-by-side what-if comparisons directly from the UI — no notebook cloning, no redeployment.
Step 4 - Surface Critical Alerts
The final task (04_surface_alerts) identifies patients whose risk_score exceeds a configurable threshold and appends them to the operations alert feed.
# 04_surface_alerts
threshold = float(dbutils.widgets.get("risk_threshold")) # task parameter, default: 0.85
kpi_path = dbutils.jobs.taskValues.get(taskKey="03_score_risk", key="kpi_path")
kpis = spark.table(kpi_path)
alerts = kpis.filter(F.col("risk_score") >= threshold)
alert_count = alerts.count()
alert_path = "sv_catalog.healthverity.ops_risk_alerts"
alerts.write.format("delta").mode("append").saveAsTable(alert_path)
print(f"Surge monitor finished. {alert_count} high-risk patients flagged.")
Because kpi_path is injected dynamically, this task always reads the exact Gold snapshot produced earlier in the same run — never a stale table from a prior run.
Tuning the alert system requires no engineering work: if the medical team decides the threshold should move from 0.85 to 0.9, a non-technical user can update the task parameter in the UI and click Run now. For production environments, such changes would typically flow through a CI/CD pipeline — but the point stands: the logic is fully decoupled from its configuration.
Other Scenarios where Parameters Shine
The HealthVerify pipeline illustrates healthcare data workflows, but Databricks Workflow Parameters apply equally well across industries and use cases:
Parameters turn one codebase into many bespoke pipelines — all configurable from the UI or API, with zero code changes.
Programmatic Ways to Work with Parameters
Mastering the UI is great for ad-hoc work, but production platforms demand code-driven automation. Here are three approaches to parameterize and trigger workflows entirely from code.
The Jobs REST API is the Swiss-army knife for CI/CD, Airflow, or any external orchestrator. It is language-agnostic and easy to invoke from tools like Terraform, Bash, or GitHub Actions.
curl -X POST https://<workspace>/api/2.1/jobs/runs/submit \
-H "Authorization: Bearer $TOKEN" \
-d '{
"job_id": 1234,
"parameters": {
"proc_date": "2025-06-21",
"env": "prod",
"scoring_strategy": "v2"
}
}'
See the Jobs API Reference for the full specification.
Databricks CLI
The CLI provides a human-friendly wrapper around the API — ideal for local development or shell-based pipelines. Store your job.json in Git and reference environment-specific YAML or Jinja templates for clean, reproducible deployments.
# Create or overwrite a job from a JSON definition
databricks jobs create --json-file deployment/job.json
# Trigger a run with parameter overrides
databricks jobs run-now --job-id 1234 \
--notebook-params '{"proc_date":"2025-06-30","env":"dev"}'
Recommended reading:
Databricks SDKs
Databricks SDKs provide typed, idiomatic access to the Jobs API — handling authentication, pagination, and retries automatically. They are the best choice for ML pipelines, internal tooling, or any application that manages workflows programmatically.
Conclusion
Workflow parameters are not an optional feature — they are the foundation of any production-grade pipeline on Databricks. They separate configuration from logic, enable code reuse across environments, and make dynamic task chaining both safe and scalable.
In this blog, we walked through all four parameterization concepts — job parameters, task parameters, dynamic value references, and task values — and showed how they power a real-world healthcare pipeline that moves fast, stays audit-compliant, and lets teams experiment safely. From adjusting thresholds in the UI to chaining outputs across tasks, parameters make your workflows smarter and dramatically easier to operate.
As you scale your platform, treat parameters as your pipeline's API contract. Parameterize everything. Deploy with confidence.
Clone the sample, try dynamic value references, and future-proof your DAGs.
Stay tuned for our next blog post, where we'll explore more features and capabilities of Databricks Lakeflow Jobs.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.