cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Articles
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Parallel Model Training & Data Pipelines on Databricks (ForEach Tasks+ Asset Bundles + Pydantic)

sandy311
New Contributor III

As companies double down on machine learning (ML), one thing is obvious: a single model canโ€™t solve every problem. Different datasets, different timelines, and different requirements make managing multiple models pretty tricky. And if youโ€™ve ever worked with traditional pipelines, you know the pain โ€” theyโ€™re rigid, messy to maintain, and usually need code changes every time business logic shifts.

To make life easier, we built a config-driven, parallel setup using:

  • Pydantic to keep our configs clean and validated.
  • Databricks ForEach tasks to run things in parallel and save a ton of runtime.
  • Databricks Asset Bundles (DAB) so deployments are smooth and automated
  • Databricks Task Values to pass configs around without extra hacks

The best part? Itโ€™s now super easy to scale. Onboarding a new model or dataset is as simple as dropping in a config- no touching the codebase.

๐Ÿšจ The Challenge

We needed to:

  • Train models for different datasets with their own feature/label configurations.
  • Ensure preprocessing and training stayed decoupled but linked.
  • Scale without duplicating code.
  • Pass configurations cleanly between Databricks tasks.

In short, what we were really after was:

  • Config-driven training (no hardcoding every little thing).
  • Parallel execution to actually bring runtimes down.
  • Reusable preprocessing so each dataset could run with its own filters and date ranges.
  • Automated deployment that kicks in only if validation metrics look good.

Basic Workflow

sandy311_0-1756395092044.png

 

Solution

Configuration-Driven Setup

We maintain a centralized YAML configuration for data specific models a specific runs.

training: "config for training:"
training_variants:
# Variant 1
- train_code: V1
save_data: true
run_task: true
target: target_cil
date_column: date_col
validation_months: 6
training_months: 36
model: random forest
model_type: sklearn
automated_deployment: true
hyperparameters:
n_estimators: 100
max_depth: 10
thresholds:
precision_score:
threshold: 0.70
greater_is_better: true
recall_score:
threshold: 0.60
greater_is_better: true

- train_code: V2
...so on

Each model or dataset basically carries its own config, which defines things like:

  • Features/labels
  • Training/validation windows
  • Model type + hyperparameters
  • Deployment settings
  • Performance thresholds

Because of this setup, the whole system feels:

  • Declarative โ†’ no hidden, hardcoded logic buried in code
  • Extendable โ†’ want a new variant? just drop in a new config entry
  • Safe โ†’ you can toggle runs easily with a run_task flag

Pydantic Validation

Instead of relying on ad-hoc parsing, we used Pydantic models to validate YAML.

from pydantic import BaseModel

class Hyperparameters(BaseModel๐Ÿ˜ž
n_estimators: int
max_depth: int

class TrainingConfig(BaseModel๐Ÿ˜ž
name: str
country: str
run_task: bool
model: str
model_type: str
hyperparameters: Hyperparameters

Load Config Job

model_dicts = [
model_cfg.model_dump()
for model_cfg in config.training.model_variants
]

dbutils.jobs.taskValues.set(key="models", value=model_dicts)

This way:

  • It just dumps all model configs (no filtering at this step).
  • Naming is more generic: model_cfg / model_variants.

Parallel Execution with ForEach in Databricks

The game changer is Databricks Workflows ForEach, which spawns parallel tasks per model or data.

In our case, each configuration becomes one task execution โ€” running preprocessing and training independently.

Hereโ€™s how it looks in a Databricks Asset Bundle (DAB) YAML:

resources:
jobs:
training_job:
name: Training_Job
tasks:
- task_key: LoadConfig
notebook_path: ./jobs/load_config.py
- task_key: Training
for_each:
items: "{{tasks.LoadConfig.variants}}"
task:
notebook_path: ./jobs/train.py

This enables:

  • Parallel execution of each model/data variant
  • Clean isolation of logs and artifacts per run
  • Scalability โ€” just add a new config, no new code needed

Databricks Asset Bundles (DAB)

With DAB, this entire workflow is versioned and deployed as code.

DAB lets us override environment-specific parameters without editing the core YAML or jobs.

๐Ÿ“Š Before vs After: How Our Pipeline Evolved

sandy311_1-1756395091510.png

 

๐ŸŽ Benefits

  • Onboarding in minutes โ†’ adding a new model or country is just a new YAML entry. No code changes.
  • True parallelism โ†’ each config runs independently thanks to ForEach.
  • Strong validation โ†’ configs are enforced upfront with Pydantic.
  • Hands-free deployment โ†’ DAB takes care of multi-environment rollouts.
  • Full reproducibility โ†’ MLflow + Unity Catalog track everything for lineage and governance

๐Ÿ”— References

โœจ Thanks for reading, and I hope this gave you ideas for making your ML pipelines simpler, faster, and easier to scale!

sandeepss
0 REPLIES 0