<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Parallel Model Training &amp;amp; Data Pipelines on Databricks (ForEach Tasks+ Asset Bundles + Pydantic) in Community Articles</title>
    <link>https://community.databricks.com/t5/community-articles/parallel-model-training-amp-data-pipelines-on-databricks-foreach/m-p/130055#M623</link>
    <description>&lt;P class=""&gt;As companies double down on machine learning (ML), one thing is obvious:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;a single model can’t solve every problem&lt;/STRONG&gt;. Different datasets, different timelines, and different requirements make managing multiple models pretty tricky. And if you’ve ever worked with traditional pipelines, you know the pain — they’re rigid, messy to maintain, and usually need code changes every time business logic shifts.&lt;/P&gt;&lt;P class=""&gt;To make life easier, we built a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;config-driven, parallel setup&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;using:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;Pydantic&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;to keep our configs clean and validated.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Databricks ForEach tasks&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;to run things in parallel and save a ton of runtime.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Databricks Asset Bundles (DAB)&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;so deployments are smooth and automated&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Databricks Task Values&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;to pass configs around without extra hacks&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;The best part? It’s now super easy to scale. Onboarding a new model or dataset is as simple as dropping in a config- no touching the codebase.&lt;/P&gt;&lt;H2 id="76eb"&gt;&lt;span class="lia-unicode-emoji" title=":police_car_light:"&gt;🚨&lt;/span&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;The Challenge&lt;/STRONG&gt;&lt;/H2&gt;&lt;P class=""&gt;We needed to:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;Train models for different datasets with their own feature/label configurations.&lt;/LI&gt;&lt;LI&gt;Ensure preprocessing and training stayed&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;decoupled but linked&lt;/STRONG&gt;.&lt;/LI&gt;&lt;LI&gt;Scale without duplicating code.&lt;/LI&gt;&lt;LI&gt;Pass configurations cleanly between Databricks tasks.&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;In short, what we were really after was:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;Config-driven training&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(no hardcoding every little thing).&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Parallel execution&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;to actually bring runtimes down.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Reusable preprocessing&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;so each dataset could run with its own filters and date ranges.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Automated deployment&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;that kicks in only if validation metrics look good.&lt;/LI&gt;&lt;/UL&gt;&lt;H2 id="50f6"&gt;&lt;STRONG&gt;Basic Workflow&lt;/STRONG&gt;&lt;/H2&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="sandy311_0-1756395092044.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/19450i954C5D56D9ED92CE/image-size/medium?v=v2&amp;amp;px=400" role="button" title="sandy311_0-1756395092044.png" alt="sandy311_0-1756395092044.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;H2 id="dbe9"&gt;Solution&lt;/H2&gt;&lt;H2 id="a819"&gt;Configuration-Driven Setup&lt;/H2&gt;&lt;P class=""&gt;We maintain a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;centralized YAML configuration&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;for data specific models a specific runs.&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;training:&lt;/SPAN&gt; &lt;SPAN class=""&gt;"config for training:"&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;training_variants:&lt;/SPAN&gt;&lt;BR /&gt;  &lt;SPAN class=""&gt;# Variant 1&lt;/SPAN&gt;&lt;BR /&gt;  &lt;SPAN class=""&gt;-&lt;/SPAN&gt; &lt;SPAN class=""&gt;train_code:&lt;/SPAN&gt; &lt;SPAN class=""&gt;V1&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;save_data:&lt;/SPAN&gt; &lt;SPAN class=""&gt;true&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;run_task:&lt;/SPAN&gt; &lt;SPAN class=""&gt;true&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;target:&lt;/SPAN&gt; &lt;SPAN class=""&gt;target_cil&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;date_column:&lt;/SPAN&gt; &lt;SPAN class=""&gt;date_col&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;validation_months:&lt;/SPAN&gt; &lt;SPAN class=""&gt;6&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;training_months:&lt;/SPAN&gt; &lt;SPAN class=""&gt;36&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;model:&lt;/SPAN&gt; &lt;SPAN class=""&gt;random&lt;/SPAN&gt; &lt;SPAN class=""&gt;forest&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;model_type:&lt;/SPAN&gt; &lt;SPAN class=""&gt;sklearn&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;automated_deployment:&lt;/SPAN&gt; &lt;SPAN class=""&gt;true&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;hyperparameters:&lt;/SPAN&gt;&lt;BR /&gt;      &lt;SPAN class=""&gt;n_estimators:&lt;/SPAN&gt; &lt;SPAN class=""&gt;100&lt;/SPAN&gt;&lt;BR /&gt;      &lt;SPAN class=""&gt;max_depth:&lt;/SPAN&gt; &lt;SPAN class=""&gt;10&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;thresholds:&lt;/SPAN&gt;&lt;BR /&gt;      &lt;SPAN class=""&gt;precision_score:&lt;/SPAN&gt;&lt;BR /&gt;        &lt;SPAN class=""&gt;threshold:&lt;/SPAN&gt; &lt;SPAN class=""&gt;0.70&lt;/SPAN&gt;&lt;BR /&gt;        &lt;SPAN class=""&gt;greater_is_better:&lt;/SPAN&gt; &lt;SPAN class=""&gt;true&lt;/SPAN&gt;&lt;BR /&gt;      &lt;SPAN class=""&gt;recall_score:&lt;/SPAN&gt;&lt;BR /&gt;        &lt;SPAN class=""&gt;threshold:&lt;/SPAN&gt; &lt;SPAN class=""&gt;0.60&lt;/SPAN&gt;&lt;BR /&gt;        &lt;SPAN class=""&gt;greater_is_better:&lt;/SPAN&gt; &lt;SPAN class=""&gt;true&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;  &lt;SPAN class=""&gt;-&lt;/SPAN&gt; &lt;SPAN class=""&gt;train_code:&lt;/SPAN&gt; &lt;SPAN class=""&gt;V2&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;...so&lt;/SPAN&gt; &lt;SPAN class=""&gt;on&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;Each model or dataset basically carries its own config, which defines things like:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;Features/labels&lt;/LI&gt;&lt;LI&gt;Training/validation windows&lt;/LI&gt;&lt;LI&gt;Model type + hyperparameters&lt;/LI&gt;&lt;LI&gt;Deployment settings&lt;/LI&gt;&lt;LI&gt;Performance thresholds&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;Because of this setup, the whole system feels:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;Declarative&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;→ no hidden, hardcoded logic buried in code&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Extendable&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;→ want a new variant? just drop in a new config entry&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Safe&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;→ you can toggle runs easily with a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;run_task&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;flag&lt;/LI&gt;&lt;/UL&gt;&lt;H2 id="bfab"&gt;&lt;STRONG&gt;Pydantic Validation&lt;/STRONG&gt;&lt;/H2&gt;&lt;P class=""&gt;Instead of relying on ad-hoc parsing, we used&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Pydantic models&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;to validate YAML.&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;from&lt;/SPAN&gt; pydantic &lt;SPAN class=""&gt;import&lt;/SPAN&gt; BaseModel&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;class&lt;/SPAN&gt; &lt;SPAN class=""&gt;Hyperparameters&lt;/SPAN&gt;(&lt;SPAN class=""&gt;BaseModel&lt;/SPAN&gt;&lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt;&lt;BR /&gt; n_estimators: &lt;SPAN class=""&gt;int&lt;/SPAN&gt;&lt;BR /&gt; max_depth: &lt;SPAN class=""&gt;int&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;class&lt;/SPAN&gt; &lt;SPAN class=""&gt;TrainingConfig&lt;/SPAN&gt;(&lt;SPAN class=""&gt;BaseModel&lt;/SPAN&gt;&lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt;&lt;BR /&gt; name: &lt;SPAN class=""&gt;str&lt;/SPAN&gt;&lt;BR /&gt; country: &lt;SPAN class=""&gt;str&lt;/SPAN&gt;&lt;BR /&gt; run_task: &lt;SPAN class=""&gt;bool&lt;/SPAN&gt;&lt;BR /&gt; model: &lt;SPAN class=""&gt;str&lt;/SPAN&gt;&lt;BR /&gt; model_type: &lt;SPAN class=""&gt;str&lt;/SPAN&gt;&lt;BR /&gt; hyperparameters: Hyperparameters&lt;/SPAN&gt;&lt;/PRE&gt;&lt;H2 id="c1ef"&gt;Load Config Job&lt;/H2&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;model_dicts = [&lt;BR /&gt;    model_cfg.model_dump()&lt;BR /&gt;    &lt;SPAN class=""&gt;for&lt;/SPAN&gt; model_cfg &lt;SPAN class=""&gt;in&lt;/SPAN&gt; config.training.model_variants&lt;BR /&gt;]&lt;BR /&gt;&lt;BR /&gt;dbutils.jobs.taskValues.&lt;SPAN class=""&gt;set&lt;/SPAN&gt;(key=&lt;SPAN class=""&gt;"models"&lt;/SPAN&gt;, value=model_dicts)&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;This way:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;It just dumps&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;all model configs&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(no filtering at this step).&lt;/LI&gt;&lt;LI&gt;Naming is more generic:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;model_cfg&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;/&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;model_variants.&lt;/LI&gt;&lt;/UL&gt;&lt;H2 id="4b9a"&gt;Parallel Execution with ForEach in Databricks&lt;/H2&gt;&lt;P class=""&gt;The game changer is Databricks Workflows ForEach, which spawns parallel tasks per model or data.&lt;/P&gt;&lt;P class=""&gt;In our case, each&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;configuration becomes one task execution&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;— running preprocessing and training independently.&lt;/P&gt;&lt;P class=""&gt;Here’s how it looks in a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Databricks Asset Bundle (DAB)&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;YAML:&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;resources:&lt;/SPAN&gt;&lt;BR /&gt;  &lt;SPAN class=""&gt;jobs:&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;training_job:&lt;/SPAN&gt;&lt;BR /&gt;      &lt;SPAN class=""&gt;name:&lt;/SPAN&gt; &lt;SPAN class=""&gt;Training_Job&lt;/SPAN&gt;&lt;BR /&gt;      &lt;SPAN class=""&gt;tasks:&lt;/SPAN&gt;&lt;BR /&gt;        &lt;SPAN class=""&gt;-&lt;/SPAN&gt; &lt;SPAN class=""&gt;task_key:&lt;/SPAN&gt; &lt;SPAN class=""&gt;LoadConfig&lt;/SPAN&gt;&lt;BR /&gt;          &lt;SPAN class=""&gt;notebook_path:&lt;/SPAN&gt; &lt;SPAN class=""&gt;./jobs/load_config.py&lt;/SPAN&gt;&lt;BR /&gt;        &lt;SPAN class=""&gt;-&lt;/SPAN&gt; &lt;SPAN class=""&gt;task_key:&lt;/SPAN&gt; &lt;SPAN class=""&gt;Training&lt;/SPAN&gt;&lt;BR /&gt;          &lt;SPAN class=""&gt;for_each:&lt;/SPAN&gt;&lt;BR /&gt;            &lt;SPAN class=""&gt;items:&lt;/SPAN&gt; &lt;SPAN class=""&gt;"&lt;SPAN class=""&gt;{{tasks.LoadConfig.variants}}&lt;/SPAN&gt;"&lt;/SPAN&gt;&lt;BR /&gt;            &lt;SPAN class=""&gt;task:&lt;/SPAN&gt;&lt;BR /&gt;              &lt;SPAN class=""&gt;notebook_path:&lt;/SPAN&gt; &lt;SPAN class=""&gt;./jobs/train.py&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;This enables:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;Parallel execution&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;of each model/data variant&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Clean isolation&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;of logs and artifacts per run&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Scalability&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;— just add a new config, no new code needed&lt;/LI&gt;&lt;/UL&gt;&lt;H2 id="dac9"&gt;Databricks Asset Bundles (DAB)&lt;/H2&gt;&lt;P class=""&gt;With&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;DAB&lt;/STRONG&gt;, this entire workflow is versioned and deployed as code.&lt;/P&gt;&lt;P class=""&gt;DAB lets us override&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;environment-specific parameters&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;without editing the core YAML or jobs.&lt;/P&gt;&lt;H2 id="54be"&gt;&lt;span class="lia-unicode-emoji" title=":bar_chart:"&gt;📊&lt;/span&gt; Before vs After: How Our Pipeline Evolved&lt;/H2&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="sandy311_1-1756395091510.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/19451iDD86A7688BCD11F5/image-size/medium?v=v2&amp;amp;px=400" role="button" title="sandy311_1-1756395091510.png" alt="sandy311_1-1756395091510.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;H2 id="46c8"&gt;&lt;span class="lia-unicode-emoji" title=":wrapped_gift:"&gt;🎁&lt;/span&gt; Benefits&lt;/H2&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;Onboarding in minutes&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;→ adding a new model or country is just a new YAML entry. No code changes.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;True parallelism&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;→ each config runs independently thanks to ForEach.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Strong validation&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;→ configs are enforced upfront with Pydantic.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Hands-free deployment&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;→ DAB takes care of multi-environment rollouts.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Full reproducibility&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;→ MLflow + Unity Catalog track everything for lineage and governance&lt;/LI&gt;&lt;/UL&gt;&lt;H2 id="cc20"&gt;&lt;span class="lia-unicode-emoji" title=":link:"&gt;🔗&lt;/span&gt; References&lt;/H2&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;A class="" href="https://docs.databricks.com/en/workflows/jobs/jobs.html" target="_blank" rel="noopener ugc nofollow"&gt;Databricks Workflows: Task orchestration and ForEach&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;&lt;A class="" href="https://docs.databricks.com/en/dev-tools/bundles/index.html" target="_blank" rel="noopener ugc nofollow"&gt;Databricks Asset Bundles (DAB)&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;&lt;A class="" href="https://docs.pydantic.dev/" target="_blank" rel="noopener ugc nofollow"&gt;Pydantic Documentation&lt;/A&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&lt;span class="lia-unicode-emoji" title=":sparkles:"&gt;✨&lt;/span&gt; Thanks for reading, and I hope this gave you ideas for making your ML pipelines simpler, faster, and easier to scale!&lt;/P&gt;</description>
    <pubDate>Thu, 28 Aug 2025 15:32:30 GMT</pubDate>
    <dc:creator>sandy311</dc:creator>
    <dc:date>2025-08-28T15:32:30Z</dc:date>
    <item>
      <title>Parallel Model Training &amp; Data Pipelines on Databricks (ForEach Tasks+ Asset Bundles + Pydantic)</title>
      <link>https://community.databricks.com/t5/community-articles/parallel-model-training-amp-data-pipelines-on-databricks-foreach/m-p/130055#M623</link>
      <description>&lt;P class=""&gt;As companies double down on machine learning (ML), one thing is obvious:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;a single model can’t solve every problem&lt;/STRONG&gt;. Different datasets, different timelines, and different requirements make managing multiple models pretty tricky. And if you’ve ever worked with traditional pipelines, you know the pain — they’re rigid, messy to maintain, and usually need code changes every time business logic shifts.&lt;/P&gt;&lt;P class=""&gt;To make life easier, we built a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;config-driven, parallel setup&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;using:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;Pydantic&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;to keep our configs clean and validated.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Databricks ForEach tasks&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;to run things in parallel and save a ton of runtime.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Databricks Asset Bundles (DAB)&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;so deployments are smooth and automated&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Databricks Task Values&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;to pass configs around without extra hacks&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;The best part? It’s now super easy to scale. Onboarding a new model or dataset is as simple as dropping in a config- no touching the codebase.&lt;/P&gt;&lt;H2 id="76eb"&gt;&lt;span class="lia-unicode-emoji" title=":police_car_light:"&gt;🚨&lt;/span&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;The Challenge&lt;/STRONG&gt;&lt;/H2&gt;&lt;P class=""&gt;We needed to:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;Train models for different datasets with their own feature/label configurations.&lt;/LI&gt;&lt;LI&gt;Ensure preprocessing and training stayed&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;decoupled but linked&lt;/STRONG&gt;.&lt;/LI&gt;&lt;LI&gt;Scale without duplicating code.&lt;/LI&gt;&lt;LI&gt;Pass configurations cleanly between Databricks tasks.&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;In short, what we were really after was:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;Config-driven training&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(no hardcoding every little thing).&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Parallel execution&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;to actually bring runtimes down.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Reusable preprocessing&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;so each dataset could run with its own filters and date ranges.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Automated deployment&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;that kicks in only if validation metrics look good.&lt;/LI&gt;&lt;/UL&gt;&lt;H2 id="50f6"&gt;&lt;STRONG&gt;Basic Workflow&lt;/STRONG&gt;&lt;/H2&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="sandy311_0-1756395092044.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/19450i954C5D56D9ED92CE/image-size/medium?v=v2&amp;amp;px=400" role="button" title="sandy311_0-1756395092044.png" alt="sandy311_0-1756395092044.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;H2 id="dbe9"&gt;Solution&lt;/H2&gt;&lt;H2 id="a819"&gt;Configuration-Driven Setup&lt;/H2&gt;&lt;P class=""&gt;We maintain a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;centralized YAML configuration&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;for data specific models a specific runs.&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;training:&lt;/SPAN&gt; &lt;SPAN class=""&gt;"config for training:"&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;training_variants:&lt;/SPAN&gt;&lt;BR /&gt;  &lt;SPAN class=""&gt;# Variant 1&lt;/SPAN&gt;&lt;BR /&gt;  &lt;SPAN class=""&gt;-&lt;/SPAN&gt; &lt;SPAN class=""&gt;train_code:&lt;/SPAN&gt; &lt;SPAN class=""&gt;V1&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;save_data:&lt;/SPAN&gt; &lt;SPAN class=""&gt;true&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;run_task:&lt;/SPAN&gt; &lt;SPAN class=""&gt;true&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;target:&lt;/SPAN&gt; &lt;SPAN class=""&gt;target_cil&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;date_column:&lt;/SPAN&gt; &lt;SPAN class=""&gt;date_col&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;validation_months:&lt;/SPAN&gt; &lt;SPAN class=""&gt;6&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;training_months:&lt;/SPAN&gt; &lt;SPAN class=""&gt;36&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;model:&lt;/SPAN&gt; &lt;SPAN class=""&gt;random&lt;/SPAN&gt; &lt;SPAN class=""&gt;forest&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;model_type:&lt;/SPAN&gt; &lt;SPAN class=""&gt;sklearn&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;automated_deployment:&lt;/SPAN&gt; &lt;SPAN class=""&gt;true&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;hyperparameters:&lt;/SPAN&gt;&lt;BR /&gt;      &lt;SPAN class=""&gt;n_estimators:&lt;/SPAN&gt; &lt;SPAN class=""&gt;100&lt;/SPAN&gt;&lt;BR /&gt;      &lt;SPAN class=""&gt;max_depth:&lt;/SPAN&gt; &lt;SPAN class=""&gt;10&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;thresholds:&lt;/SPAN&gt;&lt;BR /&gt;      &lt;SPAN class=""&gt;precision_score:&lt;/SPAN&gt;&lt;BR /&gt;        &lt;SPAN class=""&gt;threshold:&lt;/SPAN&gt; &lt;SPAN class=""&gt;0.70&lt;/SPAN&gt;&lt;BR /&gt;        &lt;SPAN class=""&gt;greater_is_better:&lt;/SPAN&gt; &lt;SPAN class=""&gt;true&lt;/SPAN&gt;&lt;BR /&gt;      &lt;SPAN class=""&gt;recall_score:&lt;/SPAN&gt;&lt;BR /&gt;        &lt;SPAN class=""&gt;threshold:&lt;/SPAN&gt; &lt;SPAN class=""&gt;0.60&lt;/SPAN&gt;&lt;BR /&gt;        &lt;SPAN class=""&gt;greater_is_better:&lt;/SPAN&gt; &lt;SPAN class=""&gt;true&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;  &lt;SPAN class=""&gt;-&lt;/SPAN&gt; &lt;SPAN class=""&gt;train_code:&lt;/SPAN&gt; &lt;SPAN class=""&gt;V2&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;...so&lt;/SPAN&gt; &lt;SPAN class=""&gt;on&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;Each model or dataset basically carries its own config, which defines things like:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;Features/labels&lt;/LI&gt;&lt;LI&gt;Training/validation windows&lt;/LI&gt;&lt;LI&gt;Model type + hyperparameters&lt;/LI&gt;&lt;LI&gt;Deployment settings&lt;/LI&gt;&lt;LI&gt;Performance thresholds&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;Because of this setup, the whole system feels:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;Declarative&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;→ no hidden, hardcoded logic buried in code&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Extendable&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;→ want a new variant? just drop in a new config entry&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Safe&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;→ you can toggle runs easily with a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;run_task&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;flag&lt;/LI&gt;&lt;/UL&gt;&lt;H2 id="bfab"&gt;&lt;STRONG&gt;Pydantic Validation&lt;/STRONG&gt;&lt;/H2&gt;&lt;P class=""&gt;Instead of relying on ad-hoc parsing, we used&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Pydantic models&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;to validate YAML.&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;from&lt;/SPAN&gt; pydantic &lt;SPAN class=""&gt;import&lt;/SPAN&gt; BaseModel&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;class&lt;/SPAN&gt; &lt;SPAN class=""&gt;Hyperparameters&lt;/SPAN&gt;(&lt;SPAN class=""&gt;BaseModel&lt;/SPAN&gt;&lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt;&lt;BR /&gt; n_estimators: &lt;SPAN class=""&gt;int&lt;/SPAN&gt;&lt;BR /&gt; max_depth: &lt;SPAN class=""&gt;int&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;class&lt;/SPAN&gt; &lt;SPAN class=""&gt;TrainingConfig&lt;/SPAN&gt;(&lt;SPAN class=""&gt;BaseModel&lt;/SPAN&gt;&lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt;&lt;BR /&gt; name: &lt;SPAN class=""&gt;str&lt;/SPAN&gt;&lt;BR /&gt; country: &lt;SPAN class=""&gt;str&lt;/SPAN&gt;&lt;BR /&gt; run_task: &lt;SPAN class=""&gt;bool&lt;/SPAN&gt;&lt;BR /&gt; model: &lt;SPAN class=""&gt;str&lt;/SPAN&gt;&lt;BR /&gt; model_type: &lt;SPAN class=""&gt;str&lt;/SPAN&gt;&lt;BR /&gt; hyperparameters: Hyperparameters&lt;/SPAN&gt;&lt;/PRE&gt;&lt;H2 id="c1ef"&gt;Load Config Job&lt;/H2&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;model_dicts = [&lt;BR /&gt;    model_cfg.model_dump()&lt;BR /&gt;    &lt;SPAN class=""&gt;for&lt;/SPAN&gt; model_cfg &lt;SPAN class=""&gt;in&lt;/SPAN&gt; config.training.model_variants&lt;BR /&gt;]&lt;BR /&gt;&lt;BR /&gt;dbutils.jobs.taskValues.&lt;SPAN class=""&gt;set&lt;/SPAN&gt;(key=&lt;SPAN class=""&gt;"models"&lt;/SPAN&gt;, value=model_dicts)&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;This way:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;It just dumps&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;all model configs&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(no filtering at this step).&lt;/LI&gt;&lt;LI&gt;Naming is more generic:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;model_cfg&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;/&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;model_variants.&lt;/LI&gt;&lt;/UL&gt;&lt;H2 id="4b9a"&gt;Parallel Execution with ForEach in Databricks&lt;/H2&gt;&lt;P class=""&gt;The game changer is Databricks Workflows ForEach, which spawns parallel tasks per model or data.&lt;/P&gt;&lt;P class=""&gt;In our case, each&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;configuration becomes one task execution&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;— running preprocessing and training independently.&lt;/P&gt;&lt;P class=""&gt;Here’s how it looks in a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Databricks Asset Bundle (DAB)&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;YAML:&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;resources:&lt;/SPAN&gt;&lt;BR /&gt;  &lt;SPAN class=""&gt;jobs:&lt;/SPAN&gt;&lt;BR /&gt;    &lt;SPAN class=""&gt;training_job:&lt;/SPAN&gt;&lt;BR /&gt;      &lt;SPAN class=""&gt;name:&lt;/SPAN&gt; &lt;SPAN class=""&gt;Training_Job&lt;/SPAN&gt;&lt;BR /&gt;      &lt;SPAN class=""&gt;tasks:&lt;/SPAN&gt;&lt;BR /&gt;        &lt;SPAN class=""&gt;-&lt;/SPAN&gt; &lt;SPAN class=""&gt;task_key:&lt;/SPAN&gt; &lt;SPAN class=""&gt;LoadConfig&lt;/SPAN&gt;&lt;BR /&gt;          &lt;SPAN class=""&gt;notebook_path:&lt;/SPAN&gt; &lt;SPAN class=""&gt;./jobs/load_config.py&lt;/SPAN&gt;&lt;BR /&gt;        &lt;SPAN class=""&gt;-&lt;/SPAN&gt; &lt;SPAN class=""&gt;task_key:&lt;/SPAN&gt; &lt;SPAN class=""&gt;Training&lt;/SPAN&gt;&lt;BR /&gt;          &lt;SPAN class=""&gt;for_each:&lt;/SPAN&gt;&lt;BR /&gt;            &lt;SPAN class=""&gt;items:&lt;/SPAN&gt; &lt;SPAN class=""&gt;"&lt;SPAN class=""&gt;{{tasks.LoadConfig.variants}}&lt;/SPAN&gt;"&lt;/SPAN&gt;&lt;BR /&gt;            &lt;SPAN class=""&gt;task:&lt;/SPAN&gt;&lt;BR /&gt;              &lt;SPAN class=""&gt;notebook_path:&lt;/SPAN&gt; &lt;SPAN class=""&gt;./jobs/train.py&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;This enables:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;Parallel execution&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;of each model/data variant&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Clean isolation&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;of logs and artifacts per run&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Scalability&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;— just add a new config, no new code needed&lt;/LI&gt;&lt;/UL&gt;&lt;H2 id="dac9"&gt;Databricks Asset Bundles (DAB)&lt;/H2&gt;&lt;P class=""&gt;With&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;DAB&lt;/STRONG&gt;, this entire workflow is versioned and deployed as code.&lt;/P&gt;&lt;P class=""&gt;DAB lets us override&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;environment-specific parameters&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;without editing the core YAML or jobs.&lt;/P&gt;&lt;H2 id="54be"&gt;&lt;span class="lia-unicode-emoji" title=":bar_chart:"&gt;📊&lt;/span&gt; Before vs After: How Our Pipeline Evolved&lt;/H2&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="sandy311_1-1756395091510.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/19451iDD86A7688BCD11F5/image-size/medium?v=v2&amp;amp;px=400" role="button" title="sandy311_1-1756395091510.png" alt="sandy311_1-1756395091510.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;H2 id="46c8"&gt;&lt;span class="lia-unicode-emoji" title=":wrapped_gift:"&gt;🎁&lt;/span&gt; Benefits&lt;/H2&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;Onboarding in minutes&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;→ adding a new model or country is just a new YAML entry. No code changes.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;True parallelism&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;→ each config runs independently thanks to ForEach.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Strong validation&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;→ configs are enforced upfront with Pydantic.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Hands-free deployment&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;→ DAB takes care of multi-environment rollouts.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Full reproducibility&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;→ MLflow + Unity Catalog track everything for lineage and governance&lt;/LI&gt;&lt;/UL&gt;&lt;H2 id="cc20"&gt;&lt;span class="lia-unicode-emoji" title=":link:"&gt;🔗&lt;/span&gt; References&lt;/H2&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;A class="" href="https://docs.databricks.com/en/workflows/jobs/jobs.html" target="_blank" rel="noopener ugc nofollow"&gt;Databricks Workflows: Task orchestration and ForEach&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;&lt;A class="" href="https://docs.databricks.com/en/dev-tools/bundles/index.html" target="_blank" rel="noopener ugc nofollow"&gt;Databricks Asset Bundles (DAB)&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;&lt;A class="" href="https://docs.pydantic.dev/" target="_blank" rel="noopener ugc nofollow"&gt;Pydantic Documentation&lt;/A&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&lt;span class="lia-unicode-emoji" title=":sparkles:"&gt;✨&lt;/span&gt; Thanks for reading, and I hope this gave you ideas for making your ML pipelines simpler, faster, and easier to scale!&lt;/P&gt;</description>
      <pubDate>Thu, 28 Aug 2025 15:32:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/parallel-model-training-amp-data-pipelines-on-databricks-foreach/m-p/130055#M623</guid>
      <dc:creator>sandy311</dc:creator>
      <dc:date>2025-08-28T15:32:30Z</dc:date>
    </item>
  </channel>
</rss>

