Parallel Model Training & Data Pipelines on Databricks (ForEach Tasks+ Asset Bundles + Pydantic)

sandy311 — Thu, 28 Aug 2025 15:32:30 GMT

As companies double down on machine learning (ML), one thing is obvious: a single model can’t solve every problem. Different datasets, different timelines, and different requirements make managing multiple models pretty tricky. And if you’ve ever worked with traditional pipelines, you know the pain — they’re rigid, messy to maintain, and usually need code changes every time business logic shifts.

To make life easier, we built a config-driven, parallel setup using:

Pydantic to keep our configs clean and validated.
Databricks ForEach tasks to run things in parallel and save a ton of runtime.
Databricks Asset Bundles (DAB) so deployments are smooth and automated
Databricks Task Values to pass configs around without extra hacks

The best part? It’s now super easy to scale. Onboarding a new model or dataset is as simple as dropping in a config- no touching the codebase.

🚨 The Challenge

We needed to:

Train models for different datasets with their own feature/label configurations.
Ensure preprocessing and training stayed decoupled but linked.
Scale without duplicating code.
Pass configurations cleanly between Databricks tasks.

In short, what we were really after was:

Config-driven training (no hardcoding every little thing).
Parallel execution to actually bring runtimes down.
Reusable preprocessing so each dataset could run with its own filters and date ranges.
Automated deployment that kicks in only if validation metrics look good.

Basic Workflow

Solution

Configuration-Driven Setup

We maintain a centralized YAML configuration for data specific models a specific runs.

training: "config for training:"
training_variants:
  # Variant 1
  - train_code: V1
    save_data: true
    run_task: true
    target: target_cil
    date_column: date_col
    validation_months: 6
    training_months: 36
    model: random forest
    model_type: sklearn
    automated_deployment: true
    hyperparameters:
      n_estimators: 100
      max_depth: 10
    thresholds:
      precision_score:
        threshold: 0.70
        greater_is_better: true
      recall_score:
        threshold: 0.60
        greater_is_better: true

  - train_code: V2
    ...so on

Each model or dataset basically carries its own config, which defines things like:

Features/labels
Training/validation windows
Model type + hyperparameters
Deployment settings
Performance thresholds

Because of this setup, the whole system feels:

Declarative → no hidden, hardcoded logic buried in code
Extendable → want a new variant? just drop in a new config entry
Safe → you can toggle runs easily with a run_task flag

Pydantic Validation

Instead of relying on ad-hoc parsing, we used Pydantic models to validate YAML.

from pydantic import BaseModel

class Hyperparameters(BaseModel😞
 n_estimators: int
 max_depth: int

class TrainingConfig(BaseModel😞
 name: str
 country: str
 run_task: bool
 model: str
 model_type: str
 hyperparameters: Hyperparameters

Load Config Job

model_dicts = [
    model_cfg.model_dump()
    for model_cfg in config.training.model_variants
]

dbutils.jobs.taskValues.set(key="models", value=model_dicts)

This way:

It just dumps all model configs (no filtering at this step).
Naming is more generic: model_cfg / model_variants.

Parallel Execution with ForEach in Databricks

The game changer is Databricks Workflows ForEach, which spawns parallel tasks per model or data.

In our case, each configuration becomes one task execution — running preprocessing and training independently.

Here’s how it looks in a Databricks Asset Bundle (DAB) YAML:

resources:
  jobs:
    training_job:
      name: Training_Job
      tasks:
        - task_key: LoadConfig
          notebook_path: ./jobs/load_config.py
        - task_key: Training
          for_each:
            items: "{{tasks.LoadConfig.variants}}"
            task:
              notebook_path: ./jobs/train.py

This enables:

Parallel execution of each model/data variant
Clean isolation of logs and artifacts per run
Scalability — just add a new config, no new code needed

Databricks Asset Bundles (DAB)

With DAB, this entire workflow is versioned and deployed as code.

DAB lets us override environment-specific parameters without editing the core YAML or jobs.

📊 Before vs After: How Our Pipeline Evolved

🎁 Benefits

Onboarding in minutes → adding a new model or country is just a new YAML entry. No code changes.
True parallelism → each config runs independently thanks to ForEach.
Strong validation → configs are enforced upfront with Pydantic.
Hands-free deployment → DAB takes care of multi-environment rollouts.
Full reproducibility → MLflow + Unity Catalog track everything for lineage and governance

🔗 References

✨ Thanks for reading, and I hope this gave you ideas for making your ML pipelines simpler, faster, and easier to scale!

topic Parallel Model Training & Data Pipelines on Databricks (ForEach Tasks+ Asset Bundles + Pydantic) in Community Articles