Databricks Community

amcclendon · ‎05-20-2025

Introduction

In this blog, we'll explore how to leverage Databricks’ latest innovation—AI Runtime—to efficiently pre-train large language models (LLMs). This guide is designed for ML developers and LLM engineers who need to deploy training runs on multi-GPU clusters. While we focus on a real pre-training use case, the same principles apply to fine-tuning workloads as well.

AI Runtime simplifies access to high-performance GPU clusters, such as H100s, by handling GPU orchestration under the hood. Users can interact with these resources through two main entry points: Notebooks and Jobs. This blog will focus on the latter, demonstrating how to execute workloads using an optimized AI image preloaded with essential libraries like PyTorch, CUDA, and Composer.

At Aimpoint Digital Labs, we successfully pre-trained a 1.5B parameter model on a single 8xH100 cluster—but these concepts are just as applicable to models of varying sizes.

By the end of this tutorial, you'll be able to:

Set up your Databricks environment
Initialize your LLM
Customize Composer's Trainer with advanced settings
Launch and manage Jobs in AI Runtime

Let’s dive in!

Why should I use Composer's Trainer?

Composer's trainer removes the engineering burden that many AI Researchers face when training LLMs. It enables very high grained control over training, with custom callbacks, loggers, events and more, while integrating out of the box with DeepSpeed and FSDP, as well as handy features like automatic batch size detection to maximize GPU usage. In fact, some notable LLMs like ModernBERT, SHEARED LLAMA and DNABERT-2 have leveraged the composer framework. For an overview of Composer’s Trainer check out our

Setup

As we will be processing massive volumes of data, and we don't want this to slow down our training, we will be using Mosaic's Streaming Dataset, which first requires converting the dataset to Mosaic Data Shards (MDS) (the "most performant file format for fast sample random-access"). Check out You can skip this step and use a standard dataset as well, as we will cover both options.

Requirements

Add an external location to your Databricks workspace. For setup follow this documentation.

Connecting your data bucket

Here we present an option for connecting our data so that it is accessible by our training script. It is useful both if you have an MDS dataset, or if you're using your own dataset. It is not necessary if you plan to download the dataset for every experiment you run (i.e. using Transformer's dataset.load_dataset to download from Hugging Face).

If you already have a catalog and schema to store your bucket in, skip to step 3.

Steps:

From your Databricks workspace, navigate to the 'Catalog' page, which can be found in the left-hand menu
Choose or create an appropriate catalog in Databricks. Catalogs are the highest-level in Databricks' three-level namespace.
Choose or create an appropriate schema, where we will add the external volume with our dataset. A schema is the second layer of the namespace.
1. To create a schema, click on a catalog, and click on the right-hand button that says 'Create Schema'
2. Give a name for your schema and leave the external location blank if you want the volume to be managed by Databricks
Click on the Schema you want to use, and you'll see a 'Create' button on the top right.

5. Select the option 'Create a Volume', give your volume a name so we can use it from the training script, and choose 'External volume' as volume type, to point to the external location where your dataset is.

Setting up your repo or code folder

We will set up our code in our workspace. To access your workspace:

Go to the ‘Workspace’ tab from the left-hand menu.
Follow your organization guideline's as to where to create your code. I will create mine in my Home folder
Click on the folder where we will store the code for this tutorial, and create a Folder or Git Folder.

You may need to connect Databricks to your git provider if you wish to link them, for version control.

Using Composer's Trainer

To leverage the AI Runtime, we are going to use MosaicML Composer Trainer.

Creating a composer model

In order to use Mosaic's Composer, we will need to instantiate a ComposerModel. If you are using a Hugging Face transformer model, pass it to the HuggingFaceModel class as follows:

A) Using a Hugging Face Transformer

from transformers import AutoModel
from composer.models import HuggingFaceModel
 
# huggingface model
model = AutoModel.from_pretrained('mistralai/Mistral-Nemo-Base-2407')
 
# composer model, ready to be passed to our trainer
composer_model = HuggingFaceModel(model)

B) Using a custom model

Our custom model can be anything, as long as it implements loss() and forward(). The trainer deals with calling:

x.to(device), y.to(device)
loss.backward()
optimizer.zero_grad()
optimizer.step()

Here is an example implementation of a composer-compatible custom model:

import torchvision
import torch.nn.functional as F
 
from composer.models import ComposerModel
 
class ResNet18(ComposerModel):
 
    def __init__(self):
        super().__init__()
        self.model = torchvision.models.resnet18()
 
    def forward(self, batch): # batch is the output of the dataloader
        # specify how batches are passed through the model
        inputs, _ = batch
        return self.model(inputs)
 
    def loss(self, outputs, batch):
        # pass batches and `forward` outputs to the loss
        _, targets = batch
        return F.cross_entropy(outputs, targets)
 
composer_model = ResNet18()

Optimizer and Learning Rate Scheduler

We initialize our optimizer and learning rate scheduler as usual. Composer supports any torch optimizer and scheduler. We can also use their own scheduler implementation. They have a lot of options for schedulers, such as: StepScheduler, MultiStepScheduler and ExponentialScheduler. See their documentation for the complete list. Here we are using their LinearWithWarmupScheduler.

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5) # or composer_model.model.parameters(), if using a custom model as shown above
 
lr_scheduler = composer.optim.LinearWithWarmupScheduler(
    t_warmup="1ep", 
    alpha_i=1.0,
    alpha_f=1.0
)

Dataset

To load our when working in a distributed environment, we need to initialize the process group. The Composer stack will take care of distributing the data shards appropriately. We are using the “nccl” backend which has been designed and optimized by Nvidia for GPU-to-GPU communication.

This step should be skipped if only using 1 machine.

from torch.distributed import init_process_group
 
init_process_group(backend="nccl")
device = f"cuda:{os.environ["LOCAL_RANK"]}" # Vars RANK, and WORLD_SIZE correspond to global rank and size
torch.cuda.set_device(device)

We also set torch.cuda’s device, as we are in a distributed setting. Now we can start our dataset and dataloaders. We point the StreamingDataset to the external location we mounted before, using the following syntax:

"dbfs:/Volumes/{YOUR_CATALOG_NAME}/{YOUR_SCHEMA_NAME}/{YOUR_EXTERNAL_VOLUME_NAME}/{PATH_INSIDE_YOUR_BUCKET}"

from streaming import StreamingDataset
from torch.utils.data import DataLoader
 
train_dataset = StreamingDataset(remote= "dbfs:/Volumes/{YOUR_CATALOG_NAME}/{YOUR_SCHEMA_NAME}/{YOUR_EXTERNAL_VOLUME_NAME}/{PATH_INSIDE_YOUR_BUCKET}", shuffle=None, batch_size=60,)
 
train_dataloader = DataLoader(train_dataset, batch_size=60, shuffle=None)

Note for Finetuning users (the Non-StreamingDataset approach)

If you are working with a smaller dataset for fine-tuning, setting up a StreamingDataset might be an unnecessary overhead. In this case, I recommend a simpler approach: load your dataset as usual—perhaps using Hugging Face’s load_dataset function, which supports datasets from the Hugging Face Hub or local storage. Next, tokenize your dataset and save it in its tokenized form to avoid redundant tokenization every time you start training. Finally, initialize a DataLoader, just as we did earlier, and pass it to the trainer. This step is fully customizable as long as your DataLoader outputs data in a format compatible with your model’s forward pass.

Other recommended Features

1. Loggers

Composer’s trainer takes a loggers argument where we can pass a variety of loggers, including most popular experiment trackers. Here we are using the `MLFlowLogger`, as it integrates nicely with Databricks.

from composer.loggers import MLFlowLogger
from datetime import datetime
 
loggers = MLFlowLogger(
    experiment_name="LLM_pretraining", 
    run_name= datetime.now().strftime("%Y-%m-%d-%H-%M"),
    model_registry_uri="databricks"
)

You can set loggers to be a list if you want to use various loggers. The MLFlowLogger also logs our model checkpoints as an experiment artifact, for later use.

2. Auto batch size

Finding the best batch size that maximizes GPU usage is a tedious manual task, which normally involves iteratively adjusting gradient accumulation steps, global batch and micro batch. Composer’s trainer handles this for you, with the configuration ‘device_train_microbatch_size = "auto"’. This setting makes the trainer find the micro batch size that maximizes the GPU utilization for your particular model and training run.

3. Speeding up training (DeepSpeed, FSDP and Torch Compile)

We can speed up training by using popular frameworks like FSDP and torch compile. It provided a slight speed increase, but required some tinkering around to get working, so I recommend FSDP for most use cases. This is an example of an FSDP configuration:

fsdp_config = {
    'activation_checkpointing': False,
    'activation_checkpointing_reentrant': True, # Only matters if checkpointing is True, but leaving default
    'activation_cpu_offload': False,
    'backward_prefetch': 'BACKWARD_PRE',
    'forward_prefetch': True,
    'cpu_offload': False,
    #'mixed_precision': 'PURE', # More aggressive precision reduction, can improve speed if stable
    'sharding_strategy': 'SHARD_GRAD_OP', # Shard gradients only, may reduce overhead
    'sync_module_states': False, # Skip initial state sync for a slight performance gain
    'use_orig_params': True,
    'verbose': False,
}

Launching a training job with FSDP is as simple as passing this config to the trainer.

We can also specify configurations for torch compile:

compile_config = {

    'mode': 'default',

    'dynamic': True

}

4. Checkpointing

To enable checkpointing, we need to pass an MLFlowLogger with the option model_registry_uri="databricks", as well as passing kwargs `save_interval` and `save_folder` to the trainer. For this to work, your code must be wrapped in an `if __name__ == "__main__"` block. I recommend setting:

save_folder = "dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts/checkpoints"

The variable save_interval should be a string with an integer followed by ‘ba’ or ‘ep’ for batches or epochs, respectively. For example, "2000ba" corresponds to checkpointing every 2000 batches.

If you will only use the final checkpoint, I recommend not setting `save_interval` nor `save_folder` to speed up training.

Finally: the trainer

Gathering everything we have seen so far, creating a trainer is as simple as:

import composer
 
trainer_args = {
    "model": composer_model,
    "train_dataloader": train_dataloader,
    "max_duration": "1ep",
    "optimizers": optimizer,
    "schedulers": lr_scheduler,
    "step_schedulers_every_batch": True,
    "device": device,
    "loggers": loggers,
    "device_train_microbatch_size": "auto",
    "save_folder": "dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts/checkpoints",
    "save_interval": "2000ba", # Can end in ep for epochs, or ba for batches
    "parallelism_config": {"fsdp":fsdp_config},
    "compile_config": {
        'mode': 'default',
        'dynamic': True
    },
}
 
trainer = composer.trainer.Trainer(
    **trainer_args
)
 
trainer.fit()

Then, `trainer.fit()` launches the training run.

The whole script is:

import os
import yaml
 
from composer.loggers import MLFlowLogger
from composer.models import HuggingFaceModel
from streaming import StreamingDataset
 
import torch
from torch.distributed import init_process_group
from torch.utils.data import Dataset, DataLoader
 
from transformers import AutoModel
 
if __name__ == "__main__":
    # huggingface model
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = AutoModel.from_pretrained('mistralai/Mistral-Nemo-Base-2407').to(device)
 
    # composer model, ready to be passed to our trainer
    composer_model = HuggingFaceModel(model)
 
    optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
 
    lr_scheduler = composer.optim.LinearWithWarmupScheduler(
        t_warmup="1ep", 
        alpha_i=1.0,
        alpha_f=1.0
    )
 
    init_process_group(backend="nccl")
    device = f"cuda:{os.environ["LOCAL_RANK"]}"
    torch.cuda.set_device(device)
 
    train_dataset = StreamingDataset(remote="YOUR_REMOTE_PATH", shuffle=None, batch_size=60,)
    train_dataloader = DataLoader(train_dataset, batch_size=60, shuffle=None)
 
    loggers = MLFlowLogger(
        experiment_name="LLM_pretraining", 
        run_name=run_name,
        model_registry_uri="databricks"
    )
 
    fsdp_config = {
        'activation_checkpointing': False,
        'activation_checkpointing_reentrant': True, # Only matters if checkpointing is True, but leaving default
        'activation_cpu_offload': False,
        'backward_prefetch': 'BACKWARD_PRE', 
        'forward_prefetch': True,
        'cpu_offload': False,
        #'mixed_precision': 'PURE', # More aggressive precision reduction, can improve speed if stable
        'sharding_strategy': 'SHARD_GRAD_OP', # Shard gradients only, may reduce overhead
        'sync_module_states': False, # Skip initial state sync for a slight performance gain
        'use_orig_params': True,
        'verbose': False,
    }
 
    trainer_args = { 
        "model": composer_model, 
        "train_dataloader": train_dataloader, 
        "max_duration": "1ep", 
        "optimizers": optimizer, 
        "schedulers": lr_scheduler, 
        "step_schedulers_every_batch": True, 
        "device": device, 
        "loggers": loggers, 
        "device_train_microbatch_size": "auto", 
        "save_folder": "dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts/checkpoints", 
        "save_interval": f"2000ba", # Can end in ep for epochs, or ba for batches 
        "parallelism_config": {"fsdp":fsdp_config}, 
        "compile_config": { 
            'mode': 'default', 
            'dynamic': True 
        }, 
    }
 
        trainer = composer.trainer.Trainer(
            **trainer_args
        )
    
    trainer.fit()

AI Runtime

GPU Pool

You will first need access to a GPU pool. Y13tion (left hand menu) > GPU Pools.

Launching jobs

An AI Runtime job requires a shell script to execute.

1. Creating the Shell Script

First, create a simple launch.sh script anywhere in your workspace. I placed mine in the same folder as train.py:

#!/bin/bash

composer train.py

2. Using a YAML Configuration File (Optional but Recommended)

A powerful feature you can use is a configuration YAML file. This allows you to define parameters dynamically instead of hardcoding them in your script. To enable this, create a config.yaml file—this can be empty for now. A sample configuration might look like this:

batch_size: 60
lr: 1e-5
model: "mistralai/Mistral-Nemo-Base-2407"

The location of this YAML file will be stored in the PARAMETERS environment variable. You can then load the configuration in your training script like this:

batch_size: 60
lr: 1e-5
model: "mistralai/Mistral-Nemo-Base-2407"

Now, instead of hardcoding values, you can reference them dynamically. For example, replace:

batch_size = 60

with:

batch_size = config.get("batch_size", 60) # Defaults to 60 if not found

This approach makes your training workflow more flexible and easier to adjust without touching the code.

3. Creating a workflow

Now, let’s create our LLM training job. Go to the ‘Workflows’ section on the left hand menu and create a job.

Fill in your relevant fields:

Once you have created a job, the job list overview for your task might look like:

From this screen, click on ‘Run now’ to launch your training job. This will queue your job and execute it as soon as your compute is ready. If you notice it doesn’t launch immediately, it could be that your compute is being used for another task. You can see what is being executed in your GPU pool if you head to “Compute > GPU Pools” and find your compute.

We’ll now go over some handy features that AI Runtime offers to monitor your training job.

AI Runtime logs and metrics

When we create a job in AI Runtime, an MLflow experiment is automatically created for us, that will have information about the training run including model metrics, system metrics, and artifacts like model checkpoints.

There are two ways we can find our experiment:

Recommended: Workflows (tab in left hand menu) > Click on the task we just created > Click on your running job (easy to find, as they are sorted chronologically, and have a ‘status’ field) > In the ‘Training Output’ table, click on either ‘MLflow Run’ or ‘Detailed Logs’
Experiments (tab in left hand menu) > Click on your project (will be named something like: AiTrainingTask-YOUR_TASK_NAME)
- It might be hard to find your run, as these experiments don’t have much metadata

Once you are in the MLflow run, there’s a ton of interesting features to read about your run.

Artifacts

Under the ‘Artifacts’ tab, you’ll find a ‘logs’ folder that contains log files with the STDOut and STDErr from both the training script and all the GPU workers. In the files named “logs-n.chunk.txt” you should see all the print statements from your python script, as well as the tracebacks should an exception stop the execution.

If you have passed a `save_folder` and `save_interval` to the trainer, your model checkpoints should show under the checkpoints folder in this tab. From there, you can copy the path if you wish to use them in a different Databricks script, or you just want to download them locally.

System Metrics

Under ‘System metrics’, AI Runtime automatically logs system metrics from the driver and worker machines, such as GPU Memory usage and power usage. These metrics can be useful to identify out of memory (OOM) errors. The metric “system/gpu_0_utilization_percentage” is incredibly useful when tuning the batch size to maximize GPU utilization.

Model Metrics

The ‘Model metrics’ tab automatically logs loss from the trainer’s training loop, and other metrics such as time per batch. For the time metrics, I recommend configuring the graphs to have “time” instead of “step” on the x-axis, as the default shows “step x step” which is just a straight line. If you have set device_train_microbatch_size to "auto", you will also see a log of what the optimal batch size was found to be.

Conclusion

In this blog post, we showed how to design and execute a tailored LLM training job across a huge GPU cluster with minimal code (<100 lines) and overhead. This enables your engineering team to drive impactful results that keep your business ahead of the competition, at a fraction of the effort it would normally take.

Interested in learning more? Reach out to one of our experts today!

Who are we?

Aimpoint Digital is a market-leading analytics firm at the forefront of solving the most complex business and economic challenges through data and analytical technology. From integrating self-service analytics to implementing AI at scale and modernizing data infrastructure environments, Aimpoint Digital operates across transformative domains to improve the performance of organizations. Connect with our team and get started today.