<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>article Pretraining Large Language Models with Databrick's AI Runtime in Technical Blog</title>
    <link>https://community.databricks.com/t5/technical-blog/pretraining-large-language-models-with-databrick-s-ai-runtime/ba-p/119771</link>
    <description>&lt;H1&gt;&lt;SPAN&gt;Introduction&lt;/SPAN&gt;&lt;/H1&gt;
&lt;P&gt;In this blog, we'll explore how to leverage Databricks’ latest innovation—AI Runtime—to efficiently pre-train large language models (LLMs). This guide is designed for ML developers and LLM engineers who need to deploy training runs on multi-GPU clusters. While we focus on a real pre-training use case, the same principles apply to fine-tuning workloads as well.&lt;/P&gt;
&lt;P&gt;AI Runtime simplifies access to high-performance GPU clusters, such as H100s, by handling GPU orchestration under the hood. Users can interact with these resources through two main entry points: &lt;EM&gt;Notebooks&lt;/EM&gt; and &lt;EM&gt;Jobs&lt;/EM&gt;. This blog will focus on the latter, demonstrating how to execute workloads using an optimized AI image preloaded with essential libraries like PyTorch, CUDA, and Composer.&lt;/P&gt;
&lt;P&gt;At Aimpoint Digital Labs, we successfully pre-trained a 1.5B parameter model on a single 8xH100 cluster—but these concepts are just as applicable to models of varying sizes.&lt;/P&gt;
&lt;P&gt;By the end of this tutorial, you'll be able to:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&amp;nbsp;Set up your Databricks environment&lt;/LI&gt;
&lt;LI&gt;&amp;nbsp;Initialize your LLM&lt;/LI&gt;
&lt;LI&gt;&amp;nbsp;Customize Composer's Trainer with advanced settings&lt;/LI&gt;
&lt;LI&gt;&amp;nbsp;Launch and manage Jobs in AI Runtime&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Let’s dive in!&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Why should I use Composer's Trainer?&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;Composer's trainer removes the &lt;SPAN&gt;e&lt;/SPAN&gt;ngineering burden that many AI Researchers face when training LLMs. It enables very high grained control over training, with custom callbacks, loggers, events and more, while integrating out of the box with DeepSpeed and FSDP, as well as handy features like automatic batch size detection to maximize GPU usage. &lt;SPAN&gt;In fact&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;SPAN&gt; some notable LLM&lt;/SPAN&gt;&lt;SPAN&gt;s like &lt;/SPAN&gt;&lt;SPAN&gt;&lt;A href="https://arxiv.org/abs/2412.13663" target="_blank" rel="noopener"&gt;ModernBERT&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN&gt;,&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;&lt;A href="https://arxiv.org/abs/2310.06694" target="_blank" rel="noopener"&gt;SHEARED LLAMA&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;and&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;&lt;A href="https://arxiv.org/abs/2306.15006" target="_blank" rel="noopener"&gt;DNABERT-2&lt;/A&gt; &lt;/SPAN&gt;&lt;SPAN&gt;have leveraged the composer framework&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt; &lt;SPAN&gt;For a&lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt; overview of Composer’s &lt;/SPAN&gt;&lt;SPAN&gt;T&lt;/SPAN&gt;&lt;SPAN&gt;rain&lt;/SPAN&gt;&lt;SPAN&gt;er check&lt;/SPAN&gt;&lt;SPAN&gt; out our &lt;/SPAN&gt;&lt;/P&gt;
&lt;H1&gt;&lt;SPAN&gt;Setup&lt;/SPAN&gt;&lt;/H1&gt;
&lt;P&gt;As we will be processing massive volumes of data, and we don't want this to slow down our training, we will be using &lt;SPAN&gt;M&lt;/SPAN&gt;osaic's Streaming Dataset, which first requires converting the dataset to&amp;nbsp;&lt;A href="https://docs.mosaicml.com/projects/streaming/en/latest/preparing_datasets/dataset_format.html#mds" target="_blank" rel="noopener"&gt;Mosaic Data Shards (MDS)&lt;/A&gt;&amp;nbsp;(the "&lt;EM&gt;most performant file format for fast sample random-access&lt;/EM&gt;"). Check out You can skip this step and use a standard dataset as well, as we will cover both options.&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Requirements&lt;/SPAN&gt;&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;SPAN&gt;Add an &lt;/SPAN&gt; external location to your Databricks workspace. &lt;SPAN&gt;For setup f&lt;/SPAN&gt;ollow th&lt;SPAN&gt;is&lt;/SPAN&gt;&amp;nbsp;&lt;A href="https://docs.databricks.com/en/connect/unity-catalog/cloud-storage/external-locations.html#:~:text=To%20assign%20an%20external%20location,MANAGE%20on%20the%20external%20location." target="_blank" rel="noopener"&gt;documentation&lt;/A&gt;.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;&lt;SPAN&gt;Connecting your data bucket&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;Here we present an option for connecting our data so that it is accessible by our training script. It is useful both if you have an MDS dataset, or if you're using your own dataset. It is &lt;SPAN&gt;not &lt;/SPAN&gt;necessary if you plan to download the dataset for every experiment you run (i.e. using Transformer's &lt;SPAN&gt;dataset.load_dataset&lt;/SPAN&gt; to download from Hugging Face).&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;If you already have a &lt;SPAN&gt;c&lt;/SPAN&gt;atalog and schema to store your bucket in, skip to step 3.&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;Steps:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;From your Databricks &lt;SPAN&gt;workspace&lt;/SPAN&gt;, &lt;SPAN&gt;navigate&lt;/SPAN&gt; to the &lt;EM&gt;'Catalog'&lt;/EM&gt; page, which can be found in the left-hand menu&lt;/LI&gt;
&lt;LI&gt;Choose or create an appropriate catalog in Databricks. Catalogs are the highest-level in Databricks'&lt;SPAN&gt; three&lt;/SPAN&gt;&lt;SPAN&gt;-&lt;/SPAN&gt;&lt;SPAN&gt;level namespace.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;Choose or create an appropriate schema, where we will add the external volume with our dataset. A schema is the second layer of &lt;SPAN&gt;the &lt;/SPAN&gt;namespace.
&lt;OL&gt;
&lt;LI&gt;To create a schema, click on a catalog, and click on the right-hand button that says 'Create Schema'&lt;/LI&gt;
&lt;LI&gt;Give a name for your schema and leave the external location blank if you want the volume to be managed by Databricks&lt;/LI&gt;
&lt;/OL&gt;
&lt;/LI&gt;
&lt;LI&gt;Click on the Schema you want to use, and you'll see a &lt;EM&gt;'Create'&lt;/EM&gt; button on the top &lt;SPAN&gt;right.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="amcclendon_14-1747750517543.png" style="width: 663px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16989i5037AECDCB41A9E4/image-dimensions/663x382?v=v2" width="663" height="382" role="button" title="amcclendon_14-1747750517543.png" alt="amcclendon_14-1747750517543.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;5.&amp;nbsp;Select the option '&lt;EM&gt;Create a Volume&lt;/EM&gt;', give your volume a name so we can use it from the training script, and choose '&lt;EM&gt;External volume&lt;/EM&gt;' as volume type, to point to the external location where your dataset &lt;SPAN&gt;is.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="amcclendon_15-1747750517554.png" style="width: 660px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16990i31035F583E3A487B/image-dimensions/660x670?v=v2" width="660" height="670" role="button" title="amcclendon_15-1747750517554.png" alt="amcclendon_15-1747750517554.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Setting up your repo or code folder&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;We will set up our code in our workspace. To access your workspace:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Go to the&amp;nbsp;‘&lt;EM&gt;Workspace’&lt;/EM&gt;&amp;nbsp;tab from the left-hand &lt;SPAN&gt;menu.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;Follow your organization guideline's as to where to create your code. I will create mine in my&amp;nbsp;&lt;EM&gt;Home&lt;/EM&gt;&amp;nbsp;&lt;SPAN&gt;folder&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;Click on the folder where we will store the code for this tutorial, and create a Folder or Git Folder.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="amcclendon_16-1747750517573.png" style="width: 630px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16991i30274798604CE542/image-dimensions/630x240?v=v2" width="630" height="240" role="button" title="amcclendon_16-1747750517573.png" alt="amcclendon_16-1747750517573.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;You may need to connect Databricks to your git provider if you wish to link them, for version control.&lt;/P&gt;
&lt;H1&gt;&lt;SPAN&gt;Using Composer's Trainer&lt;/SPAN&gt;&lt;/H1&gt;
&lt;P&gt;&lt;SPAN&gt;To&lt;/SPAN&gt; leverage the AI Runtime, we are going to use &lt;SPAN&gt;M&lt;/SPAN&gt;osaic&lt;SPAN&gt;ML&lt;/SPAN&gt; &lt;SPAN&gt;C&lt;/SPAN&gt;omposer &lt;SPAN&gt;T&lt;/SPAN&gt;rainer.&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Creating a composer model&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;In order to use &lt;SPAN&gt;M&lt;/SPAN&gt;osaic's &lt;SPAN&gt;C&lt;/SPAN&gt;omposer, we will need to instantiate a&amp;nbsp;ComposerModel. If you&lt;SPAN&gt; are&lt;/SPAN&gt; using a Hugging Face transformer model, pass it to the&amp;nbsp;HuggingFaceModel&amp;nbsp;class as follows&lt;SPAN&gt;:&lt;/SPAN&gt;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;A) Using a Hugging Face Transformer&lt;/STRONG&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;from transformers import AutoModel
from composer.models import HuggingFaceModel
 
# huggingface model
model = AutoModel.from_pretrained('mistralai/Mistral-Nemo-Base-2407')
 
# composer model, ready to be passed to our trainer
composer_model = HuggingFaceModel(model)
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;B) Using a custom model&lt;/STRONG&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Our custom model can be anything, as long as it implements&amp;nbsp;&lt;SPAN&gt;loss()&lt;/SPAN&gt;&amp;nbsp;and&amp;nbsp;&lt;SPAN&gt;forward()&lt;/SPAN&gt;. The trainer deals with calling:&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;x.to(device), y.to(device)
loss.backward()
optimizer.zero_grad()
optimizer.step()&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;Here&lt;SPAN&gt; is &lt;/SPAN&gt;an example implementation of a composer-compatible custom model:&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;import torchvision
import torch.nn.functional as F
 
from composer.models import ComposerModel
 
class ResNet18(ComposerModel):
 
    def __init__(self):
        super().__init__()
        self.model = torchvision.models.resnet18()
 
    def forward(self, batch): # batch is the output of the dataloader
        # specify how batches are passed through the model
        inputs, _ = batch
        return self.model(inputs)
 
    def loss(self, outputs, batch):
        # pass batches and `forward` outputs to the loss
        _, targets = batch
        return F.cross_entropy(outputs, targets)
 
composer_model = ResNet18()
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;H2&gt;&lt;SPAN&gt;Optimizer and Learning Rate Scheduler&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;We initialize our optimizer and learning rate scheduler as usual. Composer supports any torch optimizer and scheduler. We can also use their own scheduler implementation. They have a lot of options for schedulers, such as: &lt;/SPAN&gt;&lt;SPAN&gt;StepScheduler&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;MultiStepScheduler&lt;/SPAN&gt;&lt;SPAN&gt; and &lt;/SPAN&gt;&lt;SPAN&gt;ExponentialScheduler&lt;/SPAN&gt;&lt;SPAN&gt;. See their &lt;A href="https://docs.mosaicml.com/projects/composer/en/stable/trainer/schedulers.html" target="_blank" rel="noopener"&gt;documentation&lt;/A&gt; for the complete list. Here we are using their &lt;/SPAN&gt;&lt;SPAN&gt;LinearWithWarmupScheduler&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5) # or composer_model.model.parameters(), if using a custom model as shown above
 
lr_scheduler = composer.optim.LinearWithWarmupScheduler(
    t_warmup="1ep", 
    alpha_i=1.0,
    alpha_f=1.0
)
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;H2&gt;&lt;SPAN&gt;Dataset&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;To load our &lt;/SPAN&gt;&lt;SPAN&gt; when working in a distributed environment, we need to initialize the process group. The Composer stack will take care of distributing the data shards appropriately. We are using the “nccl” backend which has been designed and optimized by Nvidia for GPU-to-GPU communication. &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;&lt;SPAN&gt;This step should be skipped if only using 1 machine&lt;/SPAN&gt;&lt;/EM&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;from torch.distributed import init_process_group
 
init_process_group(backend="nccl")
device = f"cuda:{os.environ["LOCAL_RANK"]}" # Vars RANK, and WORLD_SIZE correspond to global rank and size
torch.cuda.set_device(device)
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;We also set &lt;SPAN&gt;torch.cuda&lt;/SPAN&gt;’s device, as we are in a distributed setting. Now we can start our dataset and dataloaders. We point the &lt;SPAN&gt;StreamingDataset&lt;/SPAN&gt; to the external location we mounted before, using the following syntax:&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;"dbfs:/Volumes/{YOUR_CATALOG_NAME}/{YOUR_SCHEMA_NAME}/{YOUR_EXTERNAL_VOLUME_NAME}/{PATH_INSIDE_YOUR_BUCKET}"&lt;/SPAN&gt;&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;from streaming import StreamingDataset
from torch.utils.data import DataLoader
 
train_dataset = StreamingDataset(remote= "dbfs:/Volumes/{YOUR_CATALOG_NAME}/{YOUR_SCHEMA_NAME}/{YOUR_EXTERNAL_VOLUME_NAME}/{PATH_INSIDE_YOUR_BUCKET}", shuffle=None, batch_size=60,)
 
train_dataloader = DataLoader(train_dataset, batch_size=60, shuffle=None)
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;H3&gt;&lt;SPAN&gt;Note for Finetuning users (the Non-StreamingDataset approach) &lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;If you&lt;SPAN&gt; are&lt;/SPAN&gt; working with a smaller dataset for fine-tuning, setting up a &lt;SPAN&gt;StreamingDataset&lt;/SPAN&gt; might be &lt;SPAN&gt;an &lt;/SPAN&gt;unnecessary overhead. In this case, I recommend a simpler approach: load your dataset as usual—perhaps using Hugging Face’s &lt;SPAN&gt;load_dataset&lt;/SPAN&gt; function, which supports datasets from the Hugging Face Hub or local storage. Next, tokenize your dataset and save it in its tokenized form to avoid redundant tokenization every time you start training. Finally, initialize a &lt;SPAN&gt;DataLoader&lt;/SPAN&gt;, just as we did earlier, and pass it to the trainer. This step is fully customizable as long as your &lt;SPAN&gt;DataLoader&lt;/SPAN&gt; outputs data in a format compatible with your model’s forward pass.&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Other recommended Features&lt;/SPAN&gt;&lt;/H2&gt;
&lt;H3&gt;1.&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;Loggers&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;Composer’s trainer takes a &lt;SPAN&gt;loggers&lt;/SPAN&gt; argument where we can pass a &lt;SPAN&gt;&lt;A href="https://docs.mosaicml.com/projects/composer/en/stable/trainer/logging.html" target="_blank" rel="noopener"&gt;variety of loggers,&lt;/A&gt;&lt;/SPAN&gt; including most popular experiment trackers. Here we are using the `&lt;SPAN&gt;MLFlowLogger&lt;/SPAN&gt;`, as it integrates nicely with Databricks.&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;from composer.loggers import MLFlowLogger
from datetime import datetime
 
loggers = MLFlowLogger(
    experiment_name="LLM_pretraining", 
    run_name= datetime.now().strftime("%Y-%m-%d-%H-%M"),
    model_registry_uri="databricks"
)
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;You can set loggers to be a list if you want to use various loggers. The &lt;SPAN&gt;MLFlowLogger&lt;/SPAN&gt; also logs our model checkpoints as an experiment artifact, for later use.&lt;/P&gt;
&lt;H3&gt;2.&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;Auto batch size&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;Finding the best batch size that maximizes GPU usage is a tedious manual task, which normally involves iteratively adjusting gradient accumulation steps, global batch and micro batch. Composer’s trainer handles this for you, with the configuration &lt;SPAN&gt;‘device_train_microbatch_size = "auto"’&lt;/SPAN&gt;. This setting makes the trainer find the micro batch size that maximizes &lt;SPAN&gt;the &lt;/SPAN&gt;GPU utilization for your particular model and training run.&lt;/P&gt;
&lt;H3&gt;3.&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;Speeding up training (DeepSpeed, FSDP and Torch Compile)&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;We can speed up training by using popular frameworks like FSDP and torch compile. It provided a slight speed increase, but required some tinkering around to get working, so I recommend FSDP for most use cases. This is an example of an FSDP configuration:&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;fsdp_config = {
    'activation_checkpointing': False,
    'activation_checkpointing_reentrant': True, # Only matters if checkpointing is True, but leaving default
    'activation_cpu_offload': False,
    'backward_prefetch': 'BACKWARD_PRE',
    'forward_prefetch': True,
    'cpu_offload': False,
    #'mixed_precision': 'PURE', # More aggressive precision reduction, can improve speed if stable
    'sharding_strategy': 'SHARD_GRAD_OP', # Shard gradients only, may reduce overhead
    'sync_module_states': False, # Skip initial state sync for a slight performance gain
    'use_orig_params': True,
    'verbose': False,
}
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;&lt;SPAN&gt;L&lt;/SPAN&gt;aunching a training job with FSDP is as simple as passing this config to the trainer.&lt;/P&gt;
&lt;P&gt;We can also specify configurations for torch compile:&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;compile_config = {

    'mode': 'default',

    'dynamic': True

}&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;H3&gt;4.&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;Checkpointing&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;To enable checkpointing, we need to pass an MLFlowLogger with the option &lt;SPAN&gt;model_registry_uri="databricks"&lt;/SPAN&gt;&lt;SPAN&gt;, as well as passing kwargs &lt;/SPAN&gt;`&lt;SPAN&gt;save_interval&lt;/SPAN&gt;` and `&lt;SPAN&gt;save_folder&lt;/SPAN&gt;` to the trainer. &lt;SPAN&gt;For this to work, your code&lt;/SPAN&gt; must be wrapped in an `&lt;SPAN&gt;if __name__ == "__main__"`&lt;/SPAN&gt; block. I recommend setting:&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;save_folder = "dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts/checkpoints"&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;The variable &lt;SPAN&gt;save_interval&lt;/SPAN&gt; should be a string with an integer followed by &lt;SPAN&gt;‘&lt;/SPAN&gt;ba&lt;SPAN&gt;’&lt;/SPAN&gt; or &lt;SPAN&gt;‘&lt;/SPAN&gt;ep&lt;SPAN&gt;’&lt;/SPAN&gt; for batches or epochs, respectively. For example, &lt;SPAN&gt;"2000ba"&lt;/SPAN&gt; corresponds to checkpointing every 2000 batches.&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;If you will only use the final checkpoint, I recommend not setting `&lt;/EM&gt;&lt;SPAN&gt;save_interval&lt;/SPAN&gt;&lt;EM&gt;` nor `&lt;/EM&gt;&lt;SPAN&gt;save_folder&lt;/SPAN&gt;&lt;EM&gt;` to speed up training.&lt;/EM&gt;&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Finally: the trainer&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;Gathering everything we have seen so far, creating a trainer is as simple as:&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;import composer
 
trainer_args = {
    "model": composer_model,
    "train_dataloader": train_dataloader,
    "max_duration": "1ep",
    "optimizers": optimizer,
    "schedulers": lr_scheduler,
    "step_schedulers_every_batch": True,
    "device": device,
    "loggers": loggers,
    "device_train_microbatch_size": "auto",
    "save_folder": "dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts/checkpoints",
    "save_interval": "2000ba", # Can end in ep for epochs, or ba for batches
    "parallelism_config": {"fsdp":fsdp_config},
    "compile_config": {
        'mode': 'default',
        'dynamic': True
    },
}
 
trainer = composer.trainer.Trainer(
    **trainer_args
)
 
trainer.fit() 
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;Then, `&lt;SPAN&gt;trainer.fit()`&lt;/SPAN&gt; launches the training run.&lt;/P&gt;
&lt;P&gt;The whole script is:&lt;/P&gt;
&lt;TABLE style="width: 100%; border-style: hidden;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;import os
import yaml
 
from composer.loggers import MLFlowLogger
from composer.models import HuggingFaceModel
from streaming import StreamingDataset
 
import torch
from torch.distributed import init_process_group
from torch.utils.data import Dataset, DataLoader
 
from transformers import AutoModel
 
if __name__ == "__main__":
    # huggingface model
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = AutoModel.from_pretrained('mistralai/Mistral-Nemo-Base-2407').to(device)
 
    # composer model, ready to be passed to our trainer
    composer_model = HuggingFaceModel(model)
 
    optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
 
    lr_scheduler = composer.optim.LinearWithWarmupScheduler(
        t_warmup="1ep", 
        alpha_i=1.0,
        alpha_f=1.0
    )
 
    init_process_group(backend="nccl")
    device = f"cuda:{os.environ["LOCAL_RANK"]}"
    torch.cuda.set_device(device)
 
    train_dataset = StreamingDataset(remote="YOUR_REMOTE_PATH", shuffle=None, batch_size=60,)
    train_dataloader = DataLoader(train_dataset, batch_size=60, shuffle=None)
 
    loggers = MLFlowLogger(
        experiment_name="LLM_pretraining", 
        run_name=run_name,
        model_registry_uri="databricks"
    )
 
    fsdp_config = {
        'activation_checkpointing': False,
        'activation_checkpointing_reentrant': True, # Only matters if checkpointing is True, but leaving default
        'activation_cpu_offload': False,
        'backward_prefetch': 'BACKWARD_PRE', 
        'forward_prefetch': True,
        'cpu_offload': False,
        #'mixed_precision': 'PURE', # More aggressive precision reduction, can improve speed if stable
        'sharding_strategy': 'SHARD_GRAD_OP', # Shard gradients only, may reduce overhead
        'sync_module_states': False, # Skip initial state sync for a slight performance gain
        'use_orig_params': True,
        'verbose': False,
    }
 
    trainer_args = { 
        "model": composer_model, 
        "train_dataloader": train_dataloader, 
        "max_duration": "1ep", 
        "optimizers": optimizer, 
        "schedulers": lr_scheduler, 
        "step_schedulers_every_batch": True, 
        "device": device, 
        "loggers": loggers, 
        "device_train_microbatch_size": "auto", 
        "save_folder": "dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts/checkpoints", 
        "save_interval": f"2000ba", # Can end in ep for epochs, or ba for batches 
        "parallelism_config": {"fsdp":fsdp_config}, 
        "compile_config": { 
            'mode': 'default', 
            'dynamic': True 
        }, 
    }
 
        trainer = composer.trainer.Trainer(
            **trainer_args
        )
    
    trainer.fit()
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;H1&gt;AI Runtime&lt;/H1&gt;
&lt;H2&gt;&lt;SPAN&gt;GPU Pool&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;You will first need access to a GPU pool. Y13tion (left hand menu) &amp;gt; GPU Pools.&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Launching jobs&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;An AI Runtime job requires a shell script to execute.&lt;/P&gt;
&lt;H4&gt;&lt;SPAN&gt;1. Creating the Shell Script&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;First, create a simple launch.sh script anywhere in your workspace. I placed mine in the same folder as &lt;SPAN&gt;train.py&lt;/SPAN&gt;:&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;#!/bin/bash

composer train.py
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;H4&gt;&lt;SPAN&gt;2. Using a YAML Configuration File (Optional but Recommended)&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;A powerful feature you can use is a configuration YAML file. This allows you to define parameters dynamically instead of hardcoding them in your script. To enable this, create a &lt;SPAN&gt;config.yaml&lt;/SPAN&gt; file—this can be empty for now. A sample configuration might look like this:&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;batch_size: 60
lr: 1e-5
model: "mistralai/Mistral-Nemo-Base-2407"
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;The location of this YAML file will be stored in the &lt;SPAN&gt;PARAMETERS&lt;/SPAN&gt; environment variable. You can then load the configuration in your training script like this:&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;batch_size: 60
lr: 1e-5
model: "mistralai/Mistral-Nemo-Base-2407"
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;Now, instead of hardcoding values, you can reference them dynamically. For example, replace:&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;batch_size = 60&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;with:&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;batch_size = config.get("batch_size", 60)&amp;nbsp; # Defaults to 60 if not found&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;This approach makes your training workflow more flexible and easier to adjust without touching the code.&lt;/P&gt;
&lt;H4&gt;&lt;SPAN&gt;3. Creating a workflow&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;Now, &lt;SPAN&gt;let’s&lt;/SPAN&gt; create our LLM training job. Go to the ‘&lt;EM&gt;Workflows’&lt;/EM&gt; section on the left hand &lt;SPAN&gt;menu and&lt;/SPAN&gt; create a job.&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="amcclendon_17-1747750517578.png" style="width: 645px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16994i5430AA478BAF1140/image-dimensions/645x345?v=v2" width="645" height="345" role="button" title="amcclendon_17-1747750517578.png" alt="amcclendon_17-1747750517578.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;Fill in your relevant fields:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="amcclendon_18-1747750517586.png" style="width: 626px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16992iD5B490954247207F/image-dimensions/626x335?v=v2" width="626" height="335" role="button" title="amcclendon_18-1747750517586.png" alt="amcclendon_18-1747750517586.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;Once you have created a job, the job list overview for your task might look like:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="amcclendon_19-1747750517596.png" style="width: 644px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16993i5CC00E7B10B7B499/image-dimensions/644x359?v=v2" width="644" height="359" role="button" title="amcclendon_19-1747750517596.png" alt="amcclendon_19-1747750517596.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;From this screen, click on ‘&lt;EM&gt;Run now&lt;/EM&gt;’ to launch your training job. This will queue your job and execute it as soon as your compute is ready. If you notice it doesn’t launch immediately, it could be that your compute is being used for &lt;/SPAN&gt;&lt;SPAN&gt;another task. You can see what is being executed in your GPU pool if you head to “Compute &amp;gt; GPU Pools” and find your compute.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;We’ll now go over some handy features that AI Runtime offers to monitor your training job.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;AI Runtime logs and metrics&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;When we create a job in AI Runtime, an MLflow experiment is automatically created for us, that will have information about the training run including model metrics, system metrics, and artifacts like model checkpoints. &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;There are two ways we can find our experiment:&lt;/SPAN&gt;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN&gt;Recommended&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;: Workflows (tab in left hand menu) &amp;gt; Click on the task we just created &amp;gt; Click on your running job (easy to find, as they are sorted chronologically, and have a ‘&lt;EM&gt;status’&lt;/EM&gt; field) &amp;gt; In the ‘&lt;EM&gt;Training Output’&lt;/EM&gt; table, click on either ‘&lt;EM&gt;MLflow Run’&lt;/EM&gt; or ‘&lt;EM&gt;Detailed Logs’&lt;/EM&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Experiments (tab in left hand menu) &amp;gt; Click on your project (will be named something like: &lt;/SPAN&gt;&lt;SPAN&gt;AiTrainingTask-YOUR_TASK_NAME&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;SPAN&gt;It might be hard to find your run, as these experiments don’t have much metadata&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&lt;SPAN&gt;Once you are in the MLflow run, there’s a ton of interesting features to read about your run.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN&gt;Artifacts&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;Under the ‘&lt;EM&gt;Artifacts’&lt;/EM&gt; tab, you’ll find a ‘&lt;EM&gt;logs’&lt;/EM&gt; folder that contains log files with the STDOut and STDErr from both the training script and all the GPU workers. In the files named “&lt;/SPAN&gt;&lt;SPAN&gt;logs-n.chunk.txt&lt;/SPAN&gt;&lt;SPAN&gt;” you should see all the print statements from your python script, as well as the tracebacks should an exception stop the execution. &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;If you have passed a `&lt;/SPAN&gt;&lt;SPAN&gt;save_folder&lt;/SPAN&gt;&lt;SPAN&gt;` and `&lt;/SPAN&gt;&lt;SPAN&gt;save_interval&lt;/SPAN&gt;&lt;SPAN&gt;` to the trainer, your model checkpoints should show under the checkpoints folder in this tab. From there, you can copy the path if you wish to use them in a different Databricks script, or you just want to download them locally.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="amcclendon_20-1747750517613.png" style="width: 633px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16995iB7DA4D55CB3A9EE5/image-dimensions/633x361?v=v2" width="633" height="361" role="button" title="amcclendon_20-1747750517613.png" alt="amcclendon_20-1747750517613.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN&gt;System Metrics&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;Under ‘&lt;EM&gt;System metrics’&lt;/EM&gt;, AI Runtime automatically logs system metrics from the driver and worker machines, such as GPU Memory usage and power usage. These metrics can be useful to identify out of memory (OOM) errors. The metric “system/gpu_0_utilization_percentage” is incredibly useful when tuning the batch size to maximize GPU utilization.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN&gt;Model Metrics&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;The ‘&lt;EM&gt;Model metrics’&lt;/EM&gt; tab automatically logs loss from the trainer’s training loop, and other metrics such as time per batch. For the time metrics, I recommend configuring the graphs to have “time” instead of “step” on the x-axis, as the default shows “step x step” which is just a straight line. If you have set &lt;/SPAN&gt;&lt;SPAN&gt;device_train_microbatch_size&lt;/SPAN&gt;&lt;SPAN&gt; to &lt;/SPAN&gt;&lt;SPAN&gt;"auto"&lt;/SPAN&gt;&lt;SPAN&gt;, you will also see a log of what the optimal batch size was found to be.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H1&gt;&lt;SPAN&gt;Conclusion&lt;/SPAN&gt;&lt;/H1&gt;
&lt;P&gt;&lt;SPAN&gt;In this blog post, we showed how to design and execute a tailored LLM training job across a huge GPU cluster with minimal code (&amp;lt;100 lines) and overhead. This enables your engineering team to drive impactful results that keep your business ahead of the competition, at a fraction of the effort it would normally take.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Interested in learning more? Reach out to one of our experts today!&lt;/STRONG&gt;&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Who are we?&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;Aimpoint Digital is a market-leading analytics firm at the forefront of solving the most complex business and economic challenges through data and analytical technology. From integrating self-service analytics to implementing AI at scale and modernizing data infrastructure environments, Aimpoint Digital operates across transformative domains to improve the performance of organizations. Connect with our team and get started today.&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 21 May 2025 03:40:57 GMT</pubDate>
    <dc:creator>amcclendon</dc:creator>
    <dc:date>2025-05-21T03:40:57Z</dc:date>
    <item>
      <title>Pretraining Large Language Models with Databrick's AI Runtime</title>
      <link>https://community.databricks.com/t5/technical-blog/pretraining-large-language-models-with-databrick-s-ai-runtime/ba-p/119771</link>
      <description>&lt;H1&gt;&lt;SPAN&gt;Introduction&lt;/SPAN&gt;&lt;/H1&gt;
&lt;P&gt;In this blog, we'll explore how to leverage Databricks’ latest innovation—AI Runtime—to efficiently pre-train large language models (LLMs). This guide is designed for ML developers and LLM engineers who need to deploy training runs on multi-GPU clusters. While we focus on a real pre-training use case, the same principles apply to fine-tuning workloads as well.&lt;/P&gt;
&lt;P&gt;AI Runtime simplifies access to high-performance GPU clusters, such as H100s, by handling GPU orchestration under the hood. Users can interact with these resources through two main entry points: &lt;EM&gt;Notebooks&lt;/EM&gt; and &lt;EM&gt;Jobs&lt;/EM&gt;. This blog will focus on the latter, demonstrating how to execute workloads using an optimized AI image preloaded with essential libraries like PyTorch, CUDA, and Composer.&lt;/P&gt;
&lt;P&gt;At Aimpoint Digital Labs, we successfully pre-trained a 1.5B parameter model on a single 8xH100 cluster—but these concepts are just as applicable to models of varying sizes.&lt;/P&gt;
&lt;P&gt;By the end of this tutorial, you'll be able to:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&amp;nbsp;Set up your Databricks environment&lt;/LI&gt;
&lt;LI&gt;&amp;nbsp;Initialize your LLM&lt;/LI&gt;
&lt;LI&gt;&amp;nbsp;Customize Composer's Trainer with advanced settings&lt;/LI&gt;
&lt;LI&gt;&amp;nbsp;Launch and manage Jobs in AI Runtime&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Let’s dive in!&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Why should I use Composer's Trainer?&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;Composer's trainer removes the &lt;SPAN&gt;e&lt;/SPAN&gt;ngineering burden that many AI Researchers face when training LLMs. It enables very high grained control over training, with custom callbacks, loggers, events and more, while integrating out of the box with DeepSpeed and FSDP, as well as handy features like automatic batch size detection to maximize GPU usage. &lt;SPAN&gt;In fact&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;SPAN&gt; some notable LLM&lt;/SPAN&gt;&lt;SPAN&gt;s like &lt;/SPAN&gt;&lt;SPAN&gt;&lt;A href="https://arxiv.org/abs/2412.13663" target="_blank" rel="noopener"&gt;ModernBERT&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN&gt;,&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;&lt;A href="https://arxiv.org/abs/2310.06694" target="_blank" rel="noopener"&gt;SHEARED LLAMA&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;and&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;&lt;A href="https://arxiv.org/abs/2306.15006" target="_blank" rel="noopener"&gt;DNABERT-2&lt;/A&gt; &lt;/SPAN&gt;&lt;SPAN&gt;have leveraged the composer framework&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt; &lt;SPAN&gt;For a&lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt; overview of Composer’s &lt;/SPAN&gt;&lt;SPAN&gt;T&lt;/SPAN&gt;&lt;SPAN&gt;rain&lt;/SPAN&gt;&lt;SPAN&gt;er check&lt;/SPAN&gt;&lt;SPAN&gt; out our &lt;/SPAN&gt;&lt;/P&gt;
&lt;H1&gt;&lt;SPAN&gt;Setup&lt;/SPAN&gt;&lt;/H1&gt;
&lt;P&gt;As we will be processing massive volumes of data, and we don't want this to slow down our training, we will be using &lt;SPAN&gt;M&lt;/SPAN&gt;osaic's Streaming Dataset, which first requires converting the dataset to&amp;nbsp;&lt;A href="https://docs.mosaicml.com/projects/streaming/en/latest/preparing_datasets/dataset_format.html#mds" target="_blank" rel="noopener"&gt;Mosaic Data Shards (MDS)&lt;/A&gt;&amp;nbsp;(the "&lt;EM&gt;most performant file format for fast sample random-access&lt;/EM&gt;"). Check out You can skip this step and use a standard dataset as well, as we will cover both options.&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Requirements&lt;/SPAN&gt;&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;SPAN&gt;Add an &lt;/SPAN&gt; external location to your Databricks workspace. &lt;SPAN&gt;For setup f&lt;/SPAN&gt;ollow th&lt;SPAN&gt;is&lt;/SPAN&gt;&amp;nbsp;&lt;A href="https://docs.databricks.com/en/connect/unity-catalog/cloud-storage/external-locations.html#:~:text=To%20assign%20an%20external%20location,MANAGE%20on%20the%20external%20location." target="_blank" rel="noopener"&gt;documentation&lt;/A&gt;.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;&lt;SPAN&gt;Connecting your data bucket&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;Here we present an option for connecting our data so that it is accessible by our training script. It is useful both if you have an MDS dataset, or if you're using your own dataset. It is &lt;SPAN&gt;not &lt;/SPAN&gt;necessary if you plan to download the dataset for every experiment you run (i.e. using Transformer's &lt;SPAN&gt;dataset.load_dataset&lt;/SPAN&gt; to download from Hugging Face).&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;If you already have a &lt;SPAN&gt;c&lt;/SPAN&gt;atalog and schema to store your bucket in, skip to step 3.&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;Steps:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;From your Databricks &lt;SPAN&gt;workspace&lt;/SPAN&gt;, &lt;SPAN&gt;navigate&lt;/SPAN&gt; to the &lt;EM&gt;'Catalog'&lt;/EM&gt; page, which can be found in the left-hand menu&lt;/LI&gt;
&lt;LI&gt;Choose or create an appropriate catalog in Databricks. Catalogs are the highest-level in Databricks'&lt;SPAN&gt; three&lt;/SPAN&gt;&lt;SPAN&gt;-&lt;/SPAN&gt;&lt;SPAN&gt;level namespace.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;Choose or create an appropriate schema, where we will add the external volume with our dataset. A schema is the second layer of &lt;SPAN&gt;the &lt;/SPAN&gt;namespace.
&lt;OL&gt;
&lt;LI&gt;To create a schema, click on a catalog, and click on the right-hand button that says 'Create Schema'&lt;/LI&gt;
&lt;LI&gt;Give a name for your schema and leave the external location blank if you want the volume to be managed by Databricks&lt;/LI&gt;
&lt;/OL&gt;
&lt;/LI&gt;
&lt;LI&gt;Click on the Schema you want to use, and you'll see a &lt;EM&gt;'Create'&lt;/EM&gt; button on the top &lt;SPAN&gt;right.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="amcclendon_14-1747750517543.png" style="width: 663px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16989i5037AECDCB41A9E4/image-dimensions/663x382?v=v2" width="663" height="382" role="button" title="amcclendon_14-1747750517543.png" alt="amcclendon_14-1747750517543.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;5.&amp;nbsp;Select the option '&lt;EM&gt;Create a Volume&lt;/EM&gt;', give your volume a name so we can use it from the training script, and choose '&lt;EM&gt;External volume&lt;/EM&gt;' as volume type, to point to the external location where your dataset &lt;SPAN&gt;is.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="amcclendon_15-1747750517554.png" style="width: 660px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16990i31035F583E3A487B/image-dimensions/660x670?v=v2" width="660" height="670" role="button" title="amcclendon_15-1747750517554.png" alt="amcclendon_15-1747750517554.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Setting up your repo or code folder&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;We will set up our code in our workspace. To access your workspace:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Go to the&amp;nbsp;‘&lt;EM&gt;Workspace’&lt;/EM&gt;&amp;nbsp;tab from the left-hand &lt;SPAN&gt;menu.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;Follow your organization guideline's as to where to create your code. I will create mine in my&amp;nbsp;&lt;EM&gt;Home&lt;/EM&gt;&amp;nbsp;&lt;SPAN&gt;folder&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;Click on the folder where we will store the code for this tutorial, and create a Folder or Git Folder.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="amcclendon_16-1747750517573.png" style="width: 630px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16991i30274798604CE542/image-dimensions/630x240?v=v2" width="630" height="240" role="button" title="amcclendon_16-1747750517573.png" alt="amcclendon_16-1747750517573.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;You may need to connect Databricks to your git provider if you wish to link them, for version control.&lt;/P&gt;
&lt;H1&gt;&lt;SPAN&gt;Using Composer's Trainer&lt;/SPAN&gt;&lt;/H1&gt;
&lt;P&gt;&lt;SPAN&gt;To&lt;/SPAN&gt; leverage the AI Runtime, we are going to use &lt;SPAN&gt;M&lt;/SPAN&gt;osaic&lt;SPAN&gt;ML&lt;/SPAN&gt; &lt;SPAN&gt;C&lt;/SPAN&gt;omposer &lt;SPAN&gt;T&lt;/SPAN&gt;rainer.&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Creating a composer model&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;In order to use &lt;SPAN&gt;M&lt;/SPAN&gt;osaic's &lt;SPAN&gt;C&lt;/SPAN&gt;omposer, we will need to instantiate a&amp;nbsp;ComposerModel. If you&lt;SPAN&gt; are&lt;/SPAN&gt; using a Hugging Face transformer model, pass it to the&amp;nbsp;HuggingFaceModel&amp;nbsp;class as follows&lt;SPAN&gt;:&lt;/SPAN&gt;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;A) Using a Hugging Face Transformer&lt;/STRONG&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;from transformers import AutoModel
from composer.models import HuggingFaceModel
 
# huggingface model
model = AutoModel.from_pretrained('mistralai/Mistral-Nemo-Base-2407')
 
# composer model, ready to be passed to our trainer
composer_model = HuggingFaceModel(model)
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;B) Using a custom model&lt;/STRONG&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Our custom model can be anything, as long as it implements&amp;nbsp;&lt;SPAN&gt;loss()&lt;/SPAN&gt;&amp;nbsp;and&amp;nbsp;&lt;SPAN&gt;forward()&lt;/SPAN&gt;. The trainer deals with calling:&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;x.to(device), y.to(device)
loss.backward()
optimizer.zero_grad()
optimizer.step()&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;Here&lt;SPAN&gt; is &lt;/SPAN&gt;an example implementation of a composer-compatible custom model:&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;import torchvision
import torch.nn.functional as F
 
from composer.models import ComposerModel
 
class ResNet18(ComposerModel):
 
    def __init__(self):
        super().__init__()
        self.model = torchvision.models.resnet18()
 
    def forward(self, batch): # batch is the output of the dataloader
        # specify how batches are passed through the model
        inputs, _ = batch
        return self.model(inputs)
 
    def loss(self, outputs, batch):
        # pass batches and `forward` outputs to the loss
        _, targets = batch
        return F.cross_entropy(outputs, targets)
 
composer_model = ResNet18()
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;H2&gt;&lt;SPAN&gt;Optimizer and Learning Rate Scheduler&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;We initialize our optimizer and learning rate scheduler as usual. Composer supports any torch optimizer and scheduler. We can also use their own scheduler implementation. They have a lot of options for schedulers, such as: &lt;/SPAN&gt;&lt;SPAN&gt;StepScheduler&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;MultiStepScheduler&lt;/SPAN&gt;&lt;SPAN&gt; and &lt;/SPAN&gt;&lt;SPAN&gt;ExponentialScheduler&lt;/SPAN&gt;&lt;SPAN&gt;. See their &lt;A href="https://docs.mosaicml.com/projects/composer/en/stable/trainer/schedulers.html" target="_blank" rel="noopener"&gt;documentation&lt;/A&gt; for the complete list. Here we are using their &lt;/SPAN&gt;&lt;SPAN&gt;LinearWithWarmupScheduler&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5) # or composer_model.model.parameters(), if using a custom model as shown above
 
lr_scheduler = composer.optim.LinearWithWarmupScheduler(
    t_warmup="1ep", 
    alpha_i=1.0,
    alpha_f=1.0
)
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;H2&gt;&lt;SPAN&gt;Dataset&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;To load our &lt;/SPAN&gt;&lt;SPAN&gt; when working in a distributed environment, we need to initialize the process group. The Composer stack will take care of distributing the data shards appropriately. We are using the “nccl” backend which has been designed and optimized by Nvidia for GPU-to-GPU communication. &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;&lt;SPAN&gt;This step should be skipped if only using 1 machine&lt;/SPAN&gt;&lt;/EM&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;from torch.distributed import init_process_group
 
init_process_group(backend="nccl")
device = f"cuda:{os.environ["LOCAL_RANK"]}" # Vars RANK, and WORLD_SIZE correspond to global rank and size
torch.cuda.set_device(device)
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;We also set &lt;SPAN&gt;torch.cuda&lt;/SPAN&gt;’s device, as we are in a distributed setting. Now we can start our dataset and dataloaders. We point the &lt;SPAN&gt;StreamingDataset&lt;/SPAN&gt; to the external location we mounted before, using the following syntax:&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;"dbfs:/Volumes/{YOUR_CATALOG_NAME}/{YOUR_SCHEMA_NAME}/{YOUR_EXTERNAL_VOLUME_NAME}/{PATH_INSIDE_YOUR_BUCKET}"&lt;/SPAN&gt;&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;from streaming import StreamingDataset
from torch.utils.data import DataLoader
 
train_dataset = StreamingDataset(remote= "dbfs:/Volumes/{YOUR_CATALOG_NAME}/{YOUR_SCHEMA_NAME}/{YOUR_EXTERNAL_VOLUME_NAME}/{PATH_INSIDE_YOUR_BUCKET}", shuffle=None, batch_size=60,)
 
train_dataloader = DataLoader(train_dataset, batch_size=60, shuffle=None)
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;H3&gt;&lt;SPAN&gt;Note for Finetuning users (the Non-StreamingDataset approach) &lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;If you&lt;SPAN&gt; are&lt;/SPAN&gt; working with a smaller dataset for fine-tuning, setting up a &lt;SPAN&gt;StreamingDataset&lt;/SPAN&gt; might be &lt;SPAN&gt;an &lt;/SPAN&gt;unnecessary overhead. In this case, I recommend a simpler approach: load your dataset as usual—perhaps using Hugging Face’s &lt;SPAN&gt;load_dataset&lt;/SPAN&gt; function, which supports datasets from the Hugging Face Hub or local storage. Next, tokenize your dataset and save it in its tokenized form to avoid redundant tokenization every time you start training. Finally, initialize a &lt;SPAN&gt;DataLoader&lt;/SPAN&gt;, just as we did earlier, and pass it to the trainer. This step is fully customizable as long as your &lt;SPAN&gt;DataLoader&lt;/SPAN&gt; outputs data in a format compatible with your model’s forward pass.&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Other recommended Features&lt;/SPAN&gt;&lt;/H2&gt;
&lt;H3&gt;1.&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;Loggers&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;Composer’s trainer takes a &lt;SPAN&gt;loggers&lt;/SPAN&gt; argument where we can pass a &lt;SPAN&gt;&lt;A href="https://docs.mosaicml.com/projects/composer/en/stable/trainer/logging.html" target="_blank" rel="noopener"&gt;variety of loggers,&lt;/A&gt;&lt;/SPAN&gt; including most popular experiment trackers. Here we are using the `&lt;SPAN&gt;MLFlowLogger&lt;/SPAN&gt;`, as it integrates nicely with Databricks.&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;from composer.loggers import MLFlowLogger
from datetime import datetime
 
loggers = MLFlowLogger(
    experiment_name="LLM_pretraining", 
    run_name= datetime.now().strftime("%Y-%m-%d-%H-%M"),
    model_registry_uri="databricks"
)
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;You can set loggers to be a list if you want to use various loggers. The &lt;SPAN&gt;MLFlowLogger&lt;/SPAN&gt; also logs our model checkpoints as an experiment artifact, for later use.&lt;/P&gt;
&lt;H3&gt;2.&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;Auto batch size&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;Finding the best batch size that maximizes GPU usage is a tedious manual task, which normally involves iteratively adjusting gradient accumulation steps, global batch and micro batch. Composer’s trainer handles this for you, with the configuration &lt;SPAN&gt;‘device_train_microbatch_size = "auto"’&lt;/SPAN&gt;. This setting makes the trainer find the micro batch size that maximizes &lt;SPAN&gt;the &lt;/SPAN&gt;GPU utilization for your particular model and training run.&lt;/P&gt;
&lt;H3&gt;3.&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;Speeding up training (DeepSpeed, FSDP and Torch Compile)&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;We can speed up training by using popular frameworks like FSDP and torch compile. It provided a slight speed increase, but required some tinkering around to get working, so I recommend FSDP for most use cases. This is an example of an FSDP configuration:&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;fsdp_config = {
    'activation_checkpointing': False,
    'activation_checkpointing_reentrant': True, # Only matters if checkpointing is True, but leaving default
    'activation_cpu_offload': False,
    'backward_prefetch': 'BACKWARD_PRE',
    'forward_prefetch': True,
    'cpu_offload': False,
    #'mixed_precision': 'PURE', # More aggressive precision reduction, can improve speed if stable
    'sharding_strategy': 'SHARD_GRAD_OP', # Shard gradients only, may reduce overhead
    'sync_module_states': False, # Skip initial state sync for a slight performance gain
    'use_orig_params': True,
    'verbose': False,
}
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;&lt;SPAN&gt;L&lt;/SPAN&gt;aunching a training job with FSDP is as simple as passing this config to the trainer.&lt;/P&gt;
&lt;P&gt;We can also specify configurations for torch compile:&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;compile_config = {

    'mode': 'default',

    'dynamic': True

}&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;H3&gt;4.&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;Checkpointing&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;To enable checkpointing, we need to pass an MLFlowLogger with the option &lt;SPAN&gt;model_registry_uri="databricks"&lt;/SPAN&gt;&lt;SPAN&gt;, as well as passing kwargs &lt;/SPAN&gt;`&lt;SPAN&gt;save_interval&lt;/SPAN&gt;` and `&lt;SPAN&gt;save_folder&lt;/SPAN&gt;` to the trainer. &lt;SPAN&gt;For this to work, your code&lt;/SPAN&gt; must be wrapped in an `&lt;SPAN&gt;if __name__ == "__main__"`&lt;/SPAN&gt; block. I recommend setting:&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;save_folder = "dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts/checkpoints"&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;The variable &lt;SPAN&gt;save_interval&lt;/SPAN&gt; should be a string with an integer followed by &lt;SPAN&gt;‘&lt;/SPAN&gt;ba&lt;SPAN&gt;’&lt;/SPAN&gt; or &lt;SPAN&gt;‘&lt;/SPAN&gt;ep&lt;SPAN&gt;’&lt;/SPAN&gt; for batches or epochs, respectively. For example, &lt;SPAN&gt;"2000ba"&lt;/SPAN&gt; corresponds to checkpointing every 2000 batches.&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;If you will only use the final checkpoint, I recommend not setting `&lt;/EM&gt;&lt;SPAN&gt;save_interval&lt;/SPAN&gt;&lt;EM&gt;` nor `&lt;/EM&gt;&lt;SPAN&gt;save_folder&lt;/SPAN&gt;&lt;EM&gt;` to speed up training.&lt;/EM&gt;&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Finally: the trainer&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;Gathering everything we have seen so far, creating a trainer is as simple as:&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;import composer
 
trainer_args = {
    "model": composer_model,
    "train_dataloader": train_dataloader,
    "max_duration": "1ep",
    "optimizers": optimizer,
    "schedulers": lr_scheduler,
    "step_schedulers_every_batch": True,
    "device": device,
    "loggers": loggers,
    "device_train_microbatch_size": "auto",
    "save_folder": "dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts/checkpoints",
    "save_interval": "2000ba", # Can end in ep for epochs, or ba for batches
    "parallelism_config": {"fsdp":fsdp_config},
    "compile_config": {
        'mode': 'default',
        'dynamic': True
    },
}
 
trainer = composer.trainer.Trainer(
    **trainer_args
)
 
trainer.fit() 
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;Then, `&lt;SPAN&gt;trainer.fit()`&lt;/SPAN&gt; launches the training run.&lt;/P&gt;
&lt;P&gt;The whole script is:&lt;/P&gt;
&lt;TABLE style="width: 100%; border-style: hidden;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;import os
import yaml
 
from composer.loggers import MLFlowLogger
from composer.models import HuggingFaceModel
from streaming import StreamingDataset
 
import torch
from torch.distributed import init_process_group
from torch.utils.data import Dataset, DataLoader
 
from transformers import AutoModel
 
if __name__ == "__main__":
    # huggingface model
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = AutoModel.from_pretrained('mistralai/Mistral-Nemo-Base-2407').to(device)
 
    # composer model, ready to be passed to our trainer
    composer_model = HuggingFaceModel(model)
 
    optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
 
    lr_scheduler = composer.optim.LinearWithWarmupScheduler(
        t_warmup="1ep", 
        alpha_i=1.0,
        alpha_f=1.0
    )
 
    init_process_group(backend="nccl")
    device = f"cuda:{os.environ["LOCAL_RANK"]}"
    torch.cuda.set_device(device)
 
    train_dataset = StreamingDataset(remote="YOUR_REMOTE_PATH", shuffle=None, batch_size=60,)
    train_dataloader = DataLoader(train_dataset, batch_size=60, shuffle=None)
 
    loggers = MLFlowLogger(
        experiment_name="LLM_pretraining", 
        run_name=run_name,
        model_registry_uri="databricks"
    )
 
    fsdp_config = {
        'activation_checkpointing': False,
        'activation_checkpointing_reentrant': True, # Only matters if checkpointing is True, but leaving default
        'activation_cpu_offload': False,
        'backward_prefetch': 'BACKWARD_PRE', 
        'forward_prefetch': True,
        'cpu_offload': False,
        #'mixed_precision': 'PURE', # More aggressive precision reduction, can improve speed if stable
        'sharding_strategy': 'SHARD_GRAD_OP', # Shard gradients only, may reduce overhead
        'sync_module_states': False, # Skip initial state sync for a slight performance gain
        'use_orig_params': True,
        'verbose': False,
    }
 
    trainer_args = { 
        "model": composer_model, 
        "train_dataloader": train_dataloader, 
        "max_duration": "1ep", 
        "optimizers": optimizer, 
        "schedulers": lr_scheduler, 
        "step_schedulers_every_batch": True, 
        "device": device, 
        "loggers": loggers, 
        "device_train_microbatch_size": "auto", 
        "save_folder": "dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts/checkpoints", 
        "save_interval": f"2000ba", # Can end in ep for epochs, or ba for batches 
        "parallelism_config": {"fsdp":fsdp_config}, 
        "compile_config": { 
            'mode': 'default', 
            'dynamic': True 
        }, 
    }
 
        trainer = composer.trainer.Trainer(
            **trainer_args
        )
    
    trainer.fit()
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;H1&gt;AI Runtime&lt;/H1&gt;
&lt;H2&gt;&lt;SPAN&gt;GPU Pool&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;You will first need access to a GPU pool. Y13tion (left hand menu) &amp;gt; GPU Pools.&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Launching jobs&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;An AI Runtime job requires a shell script to execute.&lt;/P&gt;
&lt;H4&gt;&lt;SPAN&gt;1. Creating the Shell Script&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;First, create a simple launch.sh script anywhere in your workspace. I placed mine in the same folder as &lt;SPAN&gt;train.py&lt;/SPAN&gt;:&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;#!/bin/bash

composer train.py
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;H4&gt;&lt;SPAN&gt;2. Using a YAML Configuration File (Optional but Recommended)&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;A powerful feature you can use is a configuration YAML file. This allows you to define parameters dynamically instead of hardcoding them in your script. To enable this, create a &lt;SPAN&gt;config.yaml&lt;/SPAN&gt; file—this can be empty for now. A sample configuration might look like this:&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;batch_size: 60
lr: 1e-5
model: "mistralai/Mistral-Nemo-Base-2407"
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;The location of this YAML file will be stored in the &lt;SPAN&gt;PARAMETERS&lt;/SPAN&gt; environment variable. You can then load the configuration in your training script like this:&lt;/P&gt;
&lt;TABLE style="border-style: hidden; width: 100%;" border="1" width="100%"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="100%"&gt;&lt;LI-CODE lang="markup"&gt;batch_size: 60
lr: 1e-5
model: "mistralai/Mistral-Nemo-Base-2407"
&lt;/LI-CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;Now, instead of hardcoding values, you can reference them dynamically. For example, replace:&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;batch_size = 60&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;with:&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;batch_size = config.get("batch_size", 60)&amp;nbsp; # Defaults to 60 if not found&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;This approach makes your training workflow more flexible and easier to adjust without touching the code.&lt;/P&gt;
&lt;H4&gt;&lt;SPAN&gt;3. Creating a workflow&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;Now, &lt;SPAN&gt;let’s&lt;/SPAN&gt; create our LLM training job. Go to the ‘&lt;EM&gt;Workflows’&lt;/EM&gt; section on the left hand &lt;SPAN&gt;menu and&lt;/SPAN&gt; create a job.&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="amcclendon_17-1747750517578.png" style="width: 645px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16994i5430AA478BAF1140/image-dimensions/645x345?v=v2" width="645" height="345" role="button" title="amcclendon_17-1747750517578.png" alt="amcclendon_17-1747750517578.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;Fill in your relevant fields:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="amcclendon_18-1747750517586.png" style="width: 626px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16992iD5B490954247207F/image-dimensions/626x335?v=v2" width="626" height="335" role="button" title="amcclendon_18-1747750517586.png" alt="amcclendon_18-1747750517586.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;Once you have created a job, the job list overview for your task might look like:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="amcclendon_19-1747750517596.png" style="width: 644px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16993i5CC00E7B10B7B499/image-dimensions/644x359?v=v2" width="644" height="359" role="button" title="amcclendon_19-1747750517596.png" alt="amcclendon_19-1747750517596.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;From this screen, click on ‘&lt;EM&gt;Run now&lt;/EM&gt;’ to launch your training job. This will queue your job and execute it as soon as your compute is ready. If you notice it doesn’t launch immediately, it could be that your compute is being used for &lt;/SPAN&gt;&lt;SPAN&gt;another task. You can see what is being executed in your GPU pool if you head to “Compute &amp;gt; GPU Pools” and find your compute.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;We’ll now go over some handy features that AI Runtime offers to monitor your training job.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;AI Runtime logs and metrics&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;When we create a job in AI Runtime, an MLflow experiment is automatically created for us, that will have information about the training run including model metrics, system metrics, and artifacts like model checkpoints. &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;There are two ways we can find our experiment:&lt;/SPAN&gt;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN&gt;Recommended&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;: Workflows (tab in left hand menu) &amp;gt; Click on the task we just created &amp;gt; Click on your running job (easy to find, as they are sorted chronologically, and have a ‘&lt;EM&gt;status’&lt;/EM&gt; field) &amp;gt; In the ‘&lt;EM&gt;Training Output’&lt;/EM&gt; table, click on either ‘&lt;EM&gt;MLflow Run’&lt;/EM&gt; or ‘&lt;EM&gt;Detailed Logs’&lt;/EM&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Experiments (tab in left hand menu) &amp;gt; Click on your project (will be named something like: &lt;/SPAN&gt;&lt;SPAN&gt;AiTrainingTask-YOUR_TASK_NAME&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;SPAN&gt;It might be hard to find your run, as these experiments don’t have much metadata&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&lt;SPAN&gt;Once you are in the MLflow run, there’s a ton of interesting features to read about your run.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN&gt;Artifacts&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;Under the ‘&lt;EM&gt;Artifacts’&lt;/EM&gt; tab, you’ll find a ‘&lt;EM&gt;logs’&lt;/EM&gt; folder that contains log files with the STDOut and STDErr from both the training script and all the GPU workers. In the files named “&lt;/SPAN&gt;&lt;SPAN&gt;logs-n.chunk.txt&lt;/SPAN&gt;&lt;SPAN&gt;” you should see all the print statements from your python script, as well as the tracebacks should an exception stop the execution. &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;If you have passed a `&lt;/SPAN&gt;&lt;SPAN&gt;save_folder&lt;/SPAN&gt;&lt;SPAN&gt;` and `&lt;/SPAN&gt;&lt;SPAN&gt;save_interval&lt;/SPAN&gt;&lt;SPAN&gt;` to the trainer, your model checkpoints should show under the checkpoints folder in this tab. From there, you can copy the path if you wish to use them in a different Databricks script, or you just want to download them locally.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="amcclendon_20-1747750517613.png" style="width: 633px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16995iB7DA4D55CB3A9EE5/image-dimensions/633x361?v=v2" width="633" height="361" role="button" title="amcclendon_20-1747750517613.png" alt="amcclendon_20-1747750517613.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN&gt;System Metrics&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;Under ‘&lt;EM&gt;System metrics’&lt;/EM&gt;, AI Runtime automatically logs system metrics from the driver and worker machines, such as GPU Memory usage and power usage. These metrics can be useful to identify out of memory (OOM) errors. The metric “system/gpu_0_utilization_percentage” is incredibly useful when tuning the batch size to maximize GPU utilization.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN&gt;Model Metrics&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;The ‘&lt;EM&gt;Model metrics’&lt;/EM&gt; tab automatically logs loss from the trainer’s training loop, and other metrics such as time per batch. For the time metrics, I recommend configuring the graphs to have “time” instead of “step” on the x-axis, as the default shows “step x step” which is just a straight line. If you have set &lt;/SPAN&gt;&lt;SPAN&gt;device_train_microbatch_size&lt;/SPAN&gt;&lt;SPAN&gt; to &lt;/SPAN&gt;&lt;SPAN&gt;"auto"&lt;/SPAN&gt;&lt;SPAN&gt;, you will also see a log of what the optimal batch size was found to be.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H1&gt;&lt;SPAN&gt;Conclusion&lt;/SPAN&gt;&lt;/H1&gt;
&lt;P&gt;&lt;SPAN&gt;In this blog post, we showed how to design and execute a tailored LLM training job across a huge GPU cluster with minimal code (&amp;lt;100 lines) and overhead. This enables your engineering team to drive impactful results that keep your business ahead of the competition, at a fraction of the effort it would normally take.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Interested in learning more? Reach out to one of our experts today!&lt;/STRONG&gt;&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Who are we?&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;Aimpoint Digital is a market-leading analytics firm at the forefront of solving the most complex business and economic challenges through data and analytical technology. From integrating self-service analytics to implementing AI at scale and modernizing data infrastructure environments, Aimpoint Digital operates across transformative domains to improve the performance of organizations. Connect with our team and get started today.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 21 May 2025 03:40:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/technical-blog/pretraining-large-language-models-with-databrick-s-ai-runtime/ba-p/119771</guid>
      <dc:creator>amcclendon</dc:creator>
      <dc:date>2025-05-21T03:40:57Z</dc:date>
    </item>
  </channel>
</rss>

