cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
shwetav1407
Databricks Employee
Databricks Employee

Author: @shwetav1407 

Tags: #workflows, #orchestration, #jobs

Welcome to the blog series exploring Databricks Workflows, a powerful product for orchestrating data processing, machine learning, and analytics pipelines on the Databricks Data Intelligence Platform. Here, We will dive into key feature that brings flexibility and reusability to your pipelines: Workflow Parameters

Introduction - Why Parameters Matter

When managing complex data workflows, efficiency and flexibility are your most valuable tools. Imagine being able to tailor every step of your data pipeline - every notebook, every transformation, every model - to perfectly fit your needs at runtime. With workflow parameters in Databricks, you can do exactly that.

Workflow parameters act as dynamic inputs that help guide execution across your notebooks and tasks. Rather than hardcoding values, you can adjust variables on the fly based on data, runtime conditions, or the output of upstream tasks — letting your workflows adapt intelligently to any situation. Whether you are running ETL pipelines, training machine learning models, or scheduling nightly jobs, parameters help you streamline operations and maximize efficiency.

There are four foundational concepts for parameterizing workflows: 

  • Job Parameters — key-value pairs defined at the job level and pushed down to all tasks
  • Task Parameters — key-value pairs defined at the individual task level
  • Dynamic Value References — a syntax for referencing job metadata, conditions, and parameters when configuring tasks
  • Task Values — a mechanism for capturing and passing values generated during task execution to downstream tasks

In this blog, we explore all four concepts through a real-world healthcare pipeline, showing you how to make your workflows smarter, more adaptive, and easier to maintain.

Benefits of Workflow Parameterization

Workflow parameters are central to building data pipelines that are efficient, maintainable, and adaptive. Here are the primary benefits:

  • Ease of Maintenance : Parameters give you precise control over how tasks execute by letting you pass dynamic values at runtime. This means you can tailor execution to fit specific environments, user inputs, or data sources — without ever modifying the underlying code. For example, passing a file path as a parameter allows the same job to process data from different locations on every run.
  • Efficiency : One of the most impactful benefits of parameterization is the ability to automate repetitive processes. By defining parameters for key values, your workflows run with minimal manual intervention, reducing human error and freeing your team for higher-value work. Instead of editing configurations before every run, parameters let the workflow adapt automatically.
  • Flexibility : Parameters make your system resilient to change. Whether you are working with different datasets, diverse user inputs, or multiple cloud environments, parameters allow your workflow to adjust dynamically. This is especially valuable when data inputs and processing conditions change frequently — your core logic stays intact while the configuration adapts.
  • Reusability : Parameterized workflows are far easier to scale and reuse. The same pipeline can serve different customers, environments, or datasets simply by changing parameter values — no need to build and maintain multiple separate jobs.

In essence, parameters transform static, rigid systems into dynamic, adaptable workflows that scale with confidence.

Types of Workflow Parameters

Job Parameters

Job parameters are key-value pairs defined at the job level. When a job runs, these parameters are automatically pushed down to all compatible tasks. They are ideal for controlling settings that apply across the entire job — things like environment names, processing dates, or file paths.

Why they are important : 

Job parameters let you rerun jobs with different inputs without touching any code. They are especially powerful in CI/CD pipelines, where the same job definition runs in dev, staging, and prod by simply swapping a single parameter.

Important: Job parameters take precedence over task parameters. If a job parameter and a task parameter share the same key, the job parameter value wins.

Use Cases : 

  • Dynamic File paths : Pass a source path as a job parameter so the job reads from different locations on each run.
  • Execution Flags : Use boolean-style parameters to toggle data cleaning steps, enable verbose logging, or switch between load modes (full vs. incremental).

Example - JSON job configuration:

"parameters": {
	"env": "dev",
	"raw_table": "sales_raw"
}

These job parameters can also be configured directly within the Databricks Job UI under the Job Parameters section.

sv_databricks_1-1774892179986.png

Retrieve in your notebook:

env = dbutils.widgets.get("env")
raw_table = dbutils.widgets.get("raw_table")

You can also configure job parameters directly in the Databricks UI: navigate to your job, open the Job details sidebar, and click Edit parameters. Use the { } button to browse and insert available dynamic value references.

Task Parameters

Task parameters are key-value pairs (or JSON arrays) defined at the individual task level. Unlike job parameters — which apply globally — task parameters let you customize the behavior of each task independently within a multi-task workflow.

Why They're Important:
Task parameters are essential when different stages of a pipeline need to behave differently. You can control task-specific logic, toggle configurations, or enable/disable operations without affecting the rest of the pipeline. How task parameters are passed to the underlying asset depends on the task type — notebook tasks use dbutils.widgets, while Python script tasks receive them as command-line arguments.

Use Cases:

  • Metadata sharing between tasks : Pass small configuration values — flags, execution modes, or status indicators — from one task to the next to coordinate behavior across pipeline stages.
  • Parameter-driven task execution : Use task parameters to conditionally control which steps run, based on factors like whether a dataset exists or which business rules apply.

Example - JSON task configuration :

{
	"task_key": "load",
	"notebook_path": "/etl/load",
	"parameters": {
		"write_mode": "append"
        }
}

You can retrieve task parameters inside a notebook using Databricks widgets like this:

write_mode = dbutils.widgets.get("write_mode")

Dynamic Value Reference

Dynamic Value References use a {{ }} double-curly-brace syntax to inject runtime information into task configurations. Rather than passing static values, these references resolve automatically at execution time — pulling in job metadata, trigger details, timestamps, or the output of upstream tasks.

Why they are important :

Dynamic value references make your workflows responsive to the conditions at runtime. They are invaluable for audit trails, conditional logic, chaining task outputs, and building backfill-capable pipelines.

Commonly Used References:

Reference

Description

{{job.id}}

The unique identifier of the job

{{job.run_id}}

The unique identifier of the job run

{{job.name}}

The job name at time of run

{{job.start_time.iso_date}}

The run start date (UTC, ISO format)

{{task.name}}

The name of the current task

{{task.run_id}}

The unique identifier of the task run

{{tasks.<task_name>.result_state}}

Result state of an upstream task (success, failed, etc.)

{{tasks.<task_name>.error_code}}

Error code for a failed upstream task

{{tasks.<task_name>.values.<key>}}

A task value published by an upstream task

{{backfill.iso_date}}

The ISO date for a backfill job run

{{workspace.id}}

The unique identifier of the workspace

Example — passing a dynamic reference as a task parameter:

"proc_date": "{{job.start_time.iso_date}}"

You can retrieve this dynamic value inside a downstream notebook using:

bronze_path = dbutils.jobs.taskValues.get(taskKey="snapshot_claims", key="bronze_path")

Task Values

Task values are a mechanism for capturing values produced during task execution and making them available to downstream tasks. Unlike parameters — which are set before a run — task values are written during execution and read by subsequent tasks at runtime.

Why they are important : 

Task values enable dynamic chaining between pipeline stages. Rather than duplicating logic or hardcoding table names, tasks can publish outputs that downstream tasks consume directly — keeping pipelines DRY (Don't Repeat Yourself) and robust across environments.

Set a task value (in the producing task):

dbutils.jobs.taskValues.set(key="bronze_path", value=bronze_table)

Reference in a downstream task configuration:

"bronze_path": "{{tasks.snapshot_claims.values.bronze_path}}"

Retrieve in the consuming notebook:

bronze_path = dbutils.jobs.taskValues.get( 
    taskKey="snapshot_claims", 
    key="bronze_path" 
)

 

Tip: In addition to task values, SQL tasks can pass their query output to downstream tasks via {{tasks.<task_name>.output.rows}} or {{tasks.<task_name>.output.first_row.<column>}}. This is particularly useful for feeding dynamic lists into For each task.

Use Case: HealthVerify Claims Risk ETL Pipeline

Every night, a national payor receives millions of raw claim lines. Business leaders want to know by the next morning which patients are trending toward high cost — so outreach nurses can intervene before the next expensive hospital visit. Auditors insist that every number published can be reproduced in court five years from now. And data scientists iterate on risk formulas weekly and cannot wait for DevOps tickets.

Those three requirements — speed, auditability, and agility — shape the pipeline we are about to build. We will use Databricks Workflow Parameters to:

  • Run the same code across dev, staging, and prod without any changes
  • Pass dynamically generated table names between tasks
  • Let analysts test new risk strategies with a single dropdown change in the UI

The pipeline consists of four tasks:

Task

Layer

Key Parameter Type Used

01_snapshot_claims

Bronze

Job parameters

02_enrich_claims

Silver

Task values (dynamic reference)

03_score_risk

Gold

Task parameter (scoring_strategy)

04_surface_alerts

Alerts

Task parameter + task value

 

Note: Tasks are prefixed with a running sequence (01_ through 04_) to make ordering and references explicit.

We will meet all four requirements with a four‑task Databricks Workflow. Each task highlights a different flavour of parameterization.

sv_databricks_2-1774892613828.png

Note: For easier understanding, we added a running sequence as a prefix for each task, for example, 01_* till 04_*.

Step 1 - Capturing the Daily’s Raw Data (Bronze)

When the workflow kicks off, the first task (01_snapshot_claims) captures an immutable record of truth for the processing date. It reads the job parameter proc_date (for example, 2025-06-22) and the job parameter env to determine whether data should land in the dev or prod database schema.

sv_databricks_3-1774892645922.png

sv_databricks_4-1774892668669.pngOnce the Bronze snapshot is written, the notebook immediately publishes the fully qualified table name as a task value:

# 01_snapshot_claims

proc_date = dbutils.widgets.get("proc_date")
env = dbutils.widgets.get("env")

bronze_table = f"sv_catalog.healthverity_{env}.bronze_claims_{proc_date.replace('-', '')}"

raw_df = spark.read.table("source_catalog.claims.raw_claims_feed") \
    .filter(F.col("proc_date") == proc_date)

raw_df.write.format("delta").mode("overwrite").saveAsTable(bronze_table)

dbutils.jobs.taskValues.set(key="bronze_path", value=bronze_table)

Publishing bronze_path here eliminates two brittle alternatives: duplicating the Silver notebook per environment, or guessing the table name by parsing strings. Downstream tasks simply ask the workflow for the value.

Step 2 - Deduplicate, Clean and Enrich (Silver)

sv_databricks_5-1774892728562.png

The second task (02_enrich_claims) picks up exactly where Bronze left off. Databricks injects the bronze_path task value via a dynamic reference ({{tasks.01_snapshot_claims.values.bronze_path}}) into the task's widget configuration, so the notebook always operates on the correct table regardless of environment.

sv_databricks_6-1774892769790.png

# 02_enrich_claims

bronze_path = dbutils.jobs.taskValues.get(
    taskKey="01_snapshot_claims",
    key="bronze_path"
)

bronze_df = spark.table(bronze_path)

cleaned_df = (
    bronze_df
    .dropDuplicates(["claim_id", "hvid"])           # remove re-processing duplicates
    .filter(F.col("diagnosis_code").isNotNull())     # drop rows missing mandatory fields
    .withColumn("age", 2025 - F.col("patient_year_of_birth").cast("int"))
)

silver_table = f"sv_catalog.healthverity.silver_claims_enriched"
cleaned_df.write.format("delta").mode("overwrite").saveAsTable(silver_table)

dbutils.jobs.taskValues.set(key="silver_path", value=silver_table)

The notebook performs three operations: deduplication (removing rows sharing claim_id and hvid), validation (discarding rows with missing diagnosis_code), and enrichment (converting patient_year_of_birth into a ready-to-use age column). The cleaned data is written to silver_claims_enriched, and silver_path is published for the next stage.

Step 3 - Score Patient Risk (Gold)

sv_databricks_7-1774892868921.png

The Gold notebook (03_score_risk) contains the analytical core of the pipeline. It pulls silver_path dynamically via a task value, and reads a task parameter scoring_strategy that controls which risk formula to apply:

  • v1 weights each chronic diagnosis at 30% and patient age at 1%.
  • v2 increases chronic weight to 50% and age to 2% (an actuarial team hypothesis).
  • Future versions — incorporating cost trend or medication adherence — require only a new branch in the notebook; no Workflow edits needed.

sv_databricks_8-1774892899180.png

# 03_score_risk

scoring_strategy = dbutils.widgets.get("scoring_strategy")

silver_df = spark.table(
    dbutils.jobs.taskValues.get(taskKey="02_enrich_claims", key="silver_path")
)

chronic_conditions = ["F1020", "F32A", "E119", "F410"]
agg = (
    silver_df
    .withColumn("chronic", F.col("diagnosis_code").isin(chronic_conditions))
    .groupBy("hvid")
    .agg(
        F.first("age").alias("age"),
        F.sum(F.col("chronic").cast("int")).alias("chronic_dx_count")
    )
)

if scoring_strategy == "v1":
    agg = agg.withColumn("risk_score", F.col("chronic_dx_count") * 0.3 + F.col("age") * 0.01)
elif scoring_strategy == "v2":
    agg = agg.withColumn("risk_score", F.col("chronic_dx_count") * 0.5 + F.col("age") * 0.02)
else:
    agg = agg.withColumn("risk_score", F.col("chronic_dx_count") * 0.4)

agg = agg.withColumn(
    "risk_level",
    F.when(F.col("risk_score") > 0.8, "Critical")
     .when(F.col("risk_score") > 0.5, "High")
     .otherwise("Medium")
)

gold_path = "sv_catalog.healthverity.gold_patient_risk_scores"
agg.write.format("delta").mode("overwrite").saveAsTable(gold_path)

dbutils.jobs.taskValues.set(key="kpi_path", value=gold_path)

Using the selected strategy, the notebook computes a risk_score between 0 and 1, and derives a risk_level (Critical, High, Medium). The results are then written to gold_patient_risk_scores.

sv_databricks_9-1774892960339.png

By parameterizing the strategy, analysts can run side-by-side what-if comparisons directly from the UI — no notebook cloning, no redeployment.

Step 4 - Surface Critical Alerts

sv_databricks_10-1774892981944.png

The final task (04_surface_alerts) identifies patients whose risk_score exceeds a configurable threshold and appends them to the operations alert feed.

sv_databricks_11-1774893009706.png

# 04_surface_alerts

threshold = float(dbutils.widgets.get("risk_threshold"))  # task parameter, default: 0.85
kpi_path = dbutils.jobs.taskValues.get(taskKey="03_score_risk", key="kpi_path")

kpis = spark.table(kpi_path)
alerts = kpis.filter(F.col("risk_score") >= threshold)
alert_count = alerts.count()

alert_path = "sv_catalog.healthverity.ops_risk_alerts"
alerts.write.format("delta").mode("append").saveAsTable(alert_path)

print(f"Surge monitor finished. {alert_count} high-risk patients flagged.")

Because kpi_path is injected dynamically, this task always reads the exact Gold snapshot produced earlier in the same run — never a stale table from a prior run.

sv_databricks_12-1774893036004.png

Tuning the alert system requires no engineering work: if the medical team decides the threshold should move from 0.85 to 0.9, a non-technical user can update the task parameter in the UI and click Run now. For production environments, such changes would typically flow through a CI/CD pipeline — but the point stands: the logic is fully decoupled from its configuration.

Other Scenarios where Parameters Shine

The HealthVerify pipeline illustrates healthcare data workflows, but Databricks Workflow Parameters apply equally well across industries and use cases:

  • Multi-region Data Lakes: Use a region_code job parameter to route ingestion or reporting logic to the correct S3 bucket or Unity Catalog volume per geography (e.g., us_east, eu_west). One job definition covers every region.
  • Model Training Pipelines: Use task parameters such as model_version, feature_set, or train_ratio to toggle configurations without editing notebooks. Chain task values from feature engineering tasks into model training or evaluation steps for seamless experiment tracking.
  • Backfills and Retrospective Analysis: A start_date and end_date job parameter pair lets you reuse entire workflows for historical backfills — critical in regulated industries and anomaly detection scenarios. The {{backfill.iso_date}} dynamic reference provides native support for this pattern.
  • Multi-Tenant ETL: A partner_id parameter can control table names, filters, and routing logic — letting one set of notebooks serve dozens of tenants through branching logic alone.
  • Conditional Pipeline Logic: Combine {{tasks.<task_name>.result_state}} dynamic references with If/else tasks to build pipelines that fork based on upstream success or failure, without additional orchestration tooling.
  • Iterative Processing: Use {{tasks.<task_name>.output.rows}} to feed a SQL task's output into a For each task, processing each row as a separate parameterized run — ideal for fan-out patterns like per-customer reporting.

Parameters turn one codebase into many bespoke pipelines — all configurable from the UI or API, with zero code changes.

Programmatic Ways to Work with Parameters

Mastering the UI is great for ad-hoc work, but production platforms demand code-driven automation. Here are three approaches to parameterize and trigger workflows entirely from code.

Databricks Jobs API (REST)

The Jobs REST API is the Swiss-army knife for CI/CD, Airflow, or any external orchestrator. It is language-agnostic and easy to invoke from tools like Terraform, Bash, or GitHub Actions.

curl -X POST https://<workspace>/api/2.1/jobs/runs/submit \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
        "job_id": 1234,
        "parameters": {
            "proc_date": "2025-06-21",
            "env": "prod",
            "scoring_strategy": "v2"
        }
      }'

See the Jobs API Reference for the full specification.

Databricks CLI

The CLI provides a human-friendly wrapper around the API — ideal for local development or shell-based pipelines. Store your job.json in Git and reference environment-specific YAML or Jinja templates for clean, reproducible deployments.

# Create or overwrite a job from a JSON definition
databricks jobs create --json-file deployment/job.json

# Trigger a run with parameter overrides
databricks jobs run-now --job-id 1234 \
  --notebook-params '{"proc_date":"2025-06-30","env":"dev"}'

Recommended reading:

Databricks SDKs

Databricks SDKs provide typed, idiomatic access to the Jobs API — handling authentication, pagination, and retries automatically. They are the best choice for ML pipelines, internal tooling, or any application that manages workflows programmatically.

Conclusion

Workflow parameters are not an optional feature — they are the foundation of any production-grade pipeline on Databricks. They separate configuration from logic, enable code reuse across environments, and make dynamic task chaining both safe and scalable.

In this blog, we walked through all four parameterization concepts — job parameters, task parameters, dynamic value references, and task values — and showed how they power a real-world healthcare pipeline that moves fast, stays audit-compliant, and lets teams experiment safely. From adjusting thresholds in the UI to chaining outputs across tasks, parameters make your workflows smarter and dramatically easier to operate.

As you scale your platform, treat parameters as your pipeline's API contract. Parameterize everything. Deploy with confidence.

Clone the sample, try dynamic value references, and future-proof your DAGs.

Stay tuned for our next blog post, where we'll explore more features and capabilities of Databricks Lakeflow Jobs.