As companies double down on machine learning (ML), one thing is obvious: a single model canโt solve every problem. Different datasets, different timelines, and different requirements make managing multiple models pretty tricky. And if youโve ever worked with traditional pipelines, you know the pain โ theyโre rigid, messy to maintain, and usually need code changes every time business logic shifts.
To make life easier, we built a config-driven, parallel setup using:
- Pydantic to keep our configs clean and validated.
- Databricks ForEach tasks to run things in parallel and save a ton of runtime.
- Databricks Asset Bundles (DAB) so deployments are smooth and automated
- Databricks Task Values to pass configs around without extra hacks
The best part? Itโs now super easy to scale. Onboarding a new model or dataset is as simple as dropping in a config- no touching the codebase.
๐จ The Challenge
We needed to:
- Train models for different datasets with their own feature/label configurations.
- Ensure preprocessing and training stayed decoupled but linked.
- Scale without duplicating code.
- Pass configurations cleanly between Databricks tasks.
In short, what we were really after was:
- Config-driven training (no hardcoding every little thing).
- Parallel execution to actually bring runtimes down.
- Reusable preprocessing so each dataset could run with its own filters and date ranges.
- Automated deployment that kicks in only if validation metrics look good.
Basic Workflow
Solution
Configuration-Driven Setup
We maintain a centralized YAML configuration for data specific models a specific runs.
training: "config for training:"
training_variants:
# Variant 1
- train_code: V1
save_data: true
run_task: true
target: target_cil
date_column: date_col
validation_months: 6
training_months: 36
model: random forest
model_type: sklearn
automated_deployment: true
hyperparameters:
n_estimators: 100
max_depth: 10
thresholds:
precision_score:
threshold: 0.70
greater_is_better: true
recall_score:
threshold: 0.60
greater_is_better: true
- train_code: V2
...so on
Each model or dataset basically carries its own config, which defines things like:
- Features/labels
- Training/validation windows
- Model type + hyperparameters
- Deployment settings
- Performance thresholds
Because of this setup, the whole system feels:
- Declarative โ no hidden, hardcoded logic buried in code
- Extendable โ want a new variant? just drop in a new config entry
- Safe โ you can toggle runs easily with a run_task flag
Pydantic Validation
Instead of relying on ad-hoc parsing, we used Pydantic models to validate YAML.
from pydantic import BaseModel
class Hyperparameters(BaseModel๐
n_estimators: int
max_depth: int
class TrainingConfig(BaseModel๐
name: str
country: str
run_task: bool
model: str
model_type: str
hyperparameters: Hyperparameters
Load Config Job
model_dicts = [
model_cfg.model_dump()
for model_cfg in config.training.model_variants
]
dbutils.jobs.taskValues.set(key="models", value=model_dicts)
This way:
- It just dumps all model configs (no filtering at this step).
- Naming is more generic: model_cfg / model_variants.
Parallel Execution with ForEach in Databricks
The game changer is Databricks Workflows ForEach, which spawns parallel tasks per model or data.
In our case, each configuration becomes one task execution โ running preprocessing and training independently.
Hereโs how it looks in a Databricks Asset Bundle (DAB) YAML:
resources:
jobs:
training_job:
name: Training_Job
tasks:
- task_key: LoadConfig
notebook_path: ./jobs/load_config.py
- task_key: Training
for_each:
items: "{{tasks.LoadConfig.variants}}"
task:
notebook_path: ./jobs/train.py
This enables:
- Parallel execution of each model/data variant
- Clean isolation of logs and artifacts per run
- Scalability โ just add a new config, no new code needed
Databricks Asset Bundles (DAB)
With DAB, this entire workflow is versioned and deployed as code.
DAB lets us override environment-specific parameters without editing the core YAML or jobs.
๐ Before vs After: How Our Pipeline Evolved
๐ Benefits
- Onboarding in minutes โ adding a new model or country is just a new YAML entry. No code changes.
- True parallelism โ each config runs independently thanks to ForEach.
- Strong validation โ configs are enforced upfront with Pydantic.
- Hands-free deployment โ DAB takes care of multi-environment rollouts.
- Full reproducibility โ MLflow + Unity Catalog track everything for lineage and governance
๐ References
โจ Thanks for reading, and I hope this gave you ideas for making your ML pipelines simpler, faster, and easier to scale!
sandeepss