In this blog, we'll explore how to leverage Databricks’ latest innovation—AI Runtime—to efficiently pre-train large language models (LLMs). This guide is designed for ML developers and LLM engineers who need to deploy training runs on multi-GPU clusters. While we focus on a real pre-training use case, the same principles apply to fine-tuning workloads as well.
AI Runtime simplifies access to high-performance GPU clusters, such as H100s, by handling GPU orchestration under the hood. Users can interact with these resources through two main entry points: Notebooks and Jobs. This blog will focus on the latter, demonstrating how to execute workloads using an optimized AI image preloaded with essential libraries like PyTorch, CUDA, and Composer.
At Aimpoint Digital Labs, we successfully pre-trained a 1.5B parameter model on a single 8xH100 cluster—but these concepts are just as applicable to models of varying sizes.
By the end of this tutorial, you'll be able to:
Let’s dive in!
Composer's trainer removes the engineering burden that many AI Researchers face when training LLMs. It enables very high grained control over training, with custom callbacks, loggers, events and more, while integrating out of the box with DeepSpeed and FSDP, as well as handy features like automatic batch size detection to maximize GPU usage. In fact, some notable LLMs like ModernBERT, SHEARED LLAMA and DNABERT-2 have leveraged the composer framework. For an overview of Composer’s Trainer check out our
As we will be processing massive volumes of data, and we don't want this to slow down our training, we will be using Mosaic's Streaming Dataset, which first requires converting the dataset to Mosaic Data Shards (MDS) (the "most performant file format for fast sample random-access"). Check out You can skip this step and use a standard dataset as well, as we will cover both options.
Here we present an option for connecting our data so that it is accessible by our training script. It is useful both if you have an MDS dataset, or if you're using your own dataset. It is not necessary if you plan to download the dataset for every experiment you run (i.e. using Transformer's dataset.load_dataset to download from Hugging Face).
If you already have a catalog and schema to store your bucket in, skip to step 3.
Steps:
5. Select the option 'Create a Volume', give your volume a name so we can use it from the training script, and choose 'External volume' as volume type, to point to the external location where your dataset is.
We will set up our code in our workspace. To access your workspace:
You may need to connect Databricks to your git provider if you wish to link them, for version control.
To leverage the AI Runtime, we are going to use MosaicML Composer Trainer.
In order to use Mosaic's Composer, we will need to instantiate a ComposerModel. If you are using a Hugging Face transformer model, pass it to the HuggingFaceModel class as follows:
|
Our custom model can be anything, as long as it implements loss() and forward(). The trainer deals with calling:
|
Here is an example implementation of a composer-compatible custom model:
|
We initialize our optimizer and learning rate scheduler as usual. Composer supports any torch optimizer and scheduler. We can also use their own scheduler implementation. They have a lot of options for schedulers, such as: StepScheduler, MultiStepScheduler and ExponentialScheduler. See their documentation for the complete list. Here we are using their LinearWithWarmupScheduler.
|
To load our when working in a distributed environment, we need to initialize the process group. The Composer stack will take care of distributing the data shards appropriately. We are using the “nccl” backend which has been designed and optimized by Nvidia for GPU-to-GPU communication.
This step should be skipped if only using 1 machine.
|
We also set torch.cuda’s device, as we are in a distributed setting. Now we can start our dataset and dataloaders. We point the StreamingDataset to the external location we mounted before, using the following syntax:
"dbfs:/Volumes/{YOUR_CATALOG_NAME}/{YOUR_SCHEMA_NAME}/{YOUR_EXTERNAL_VOLUME_NAME}/{PATH_INSIDE_YOUR_BUCKET}"
|
If you are working with a smaller dataset for fine-tuning, setting up a StreamingDataset might be an unnecessary overhead. In this case, I recommend a simpler approach: load your dataset as usual—perhaps using Hugging Face’s load_dataset function, which supports datasets from the Hugging Face Hub or local storage. Next, tokenize your dataset and save it in its tokenized form to avoid redundant tokenization every time you start training. Finally, initialize a DataLoader, just as we did earlier, and pass it to the trainer. This step is fully customizable as long as your DataLoader outputs data in a format compatible with your model’s forward pass.
Composer’s trainer takes a loggers argument where we can pass a variety of loggers, including most popular experiment trackers. Here we are using the `MLFlowLogger`, as it integrates nicely with Databricks.
|
You can set loggers to be a list if you want to use various loggers. The MLFlowLogger also logs our model checkpoints as an experiment artifact, for later use.
Finding the best batch size that maximizes GPU usage is a tedious manual task, which normally involves iteratively adjusting gradient accumulation steps, global batch and micro batch. Composer’s trainer handles this for you, with the configuration ‘device_train_microbatch_size = "auto"’. This setting makes the trainer find the micro batch size that maximizes the GPU utilization for your particular model and training run.
We can speed up training by using popular frameworks like FSDP and torch compile. It provided a slight speed increase, but required some tinkering around to get working, so I recommend FSDP for most use cases. This is an example of an FSDP configuration:
|
Launching a training job with FSDP is as simple as passing this config to the trainer.
We can also specify configurations for torch compile:
|
To enable checkpointing, we need to pass an MLFlowLogger with the option model_registry_uri="databricks", as well as passing kwargs `save_interval` and `save_folder` to the trainer. For this to work, your code must be wrapped in an `if __name__ == "__main__"` block. I recommend setting:
|
The variable save_interval should be a string with an integer followed by ‘ba’ or ‘ep’ for batches or epochs, respectively. For example, "2000ba" corresponds to checkpointing every 2000 batches.
If you will only use the final checkpoint, I recommend not setting `save_interval` nor `save_folder` to speed up training.
Gathering everything we have seen so far, creating a trainer is as simple as:
|
Then, `trainer.fit()` launches the training run.
The whole script is:
|
You will first need access to a GPU pool. Y13tion (left hand menu) > GPU Pools.
An AI Runtime job requires a shell script to execute.
First, create a simple launch.sh script anywhere in your workspace. I placed mine in the same folder as train.py:
|
A powerful feature you can use is a configuration YAML file. This allows you to define parameters dynamically instead of hardcoding them in your script. To enable this, create a config.yaml file—this can be empty for now. A sample configuration might look like this:
|
The location of this YAML file will be stored in the PARAMETERS environment variable. You can then load the configuration in your training script like this:
|
Now, instead of hardcoding values, you can reference them dynamically. For example, replace:
batch_size = 60
with:
batch_size = config.get("batch_size", 60) # Defaults to 60 if not found
This approach makes your training workflow more flexible and easier to adjust without touching the code.
Now, let’s create our LLM training job. Go to the ‘Workflows’ section on the left hand menu and create a job.
Fill in your relevant fields:
Once you have created a job, the job list overview for your task might look like:
From this screen, click on ‘Run now’ to launch your training job. This will queue your job and execute it as soon as your compute is ready. If you notice it doesn’t launch immediately, it could be that your compute is being used for another task. You can see what is being executed in your GPU pool if you head to “Compute > GPU Pools” and find your compute.
We’ll now go over some handy features that AI Runtime offers to monitor your training job.
When we create a job in AI Runtime, an MLflow experiment is automatically created for us, that will have information about the training run including model metrics, system metrics, and artifacts like model checkpoints.
There are two ways we can find our experiment:
Once you are in the MLflow run, there’s a ton of interesting features to read about your run.
Under the ‘Artifacts’ tab, you’ll find a ‘logs’ folder that contains log files with the STDOut and STDErr from both the training script and all the GPU workers. In the files named “logs-n.chunk.txt” you should see all the print statements from your python script, as well as the tracebacks should an exception stop the execution.
If you have passed a `save_folder` and `save_interval` to the trainer, your model checkpoints should show under the checkpoints folder in this tab. From there, you can copy the path if you wish to use them in a different Databricks script, or you just want to download them locally.
Under ‘System metrics’, AI Runtime automatically logs system metrics from the driver and worker machines, such as GPU Memory usage and power usage. These metrics can be useful to identify out of memory (OOM) errors. The metric “system/gpu_0_utilization_percentage” is incredibly useful when tuning the batch size to maximize GPU utilization.
The ‘Model metrics’ tab automatically logs loss from the trainer’s training loop, and other metrics such as time per batch. For the time metrics, I recommend configuring the graphs to have “time” instead of “step” on the x-axis, as the default shows “step x step” which is just a straight line. If you have set device_train_microbatch_size to "auto", you will also see a log of what the optimal batch size was found to be.
In this blog post, we showed how to design and execute a tailored LLM training job across a huge GPU cluster with minimal code (<100 lines) and overhead. This enables your engineering team to drive impactful results that keep your business ahead of the competition, at a fraction of the effort it would normally take.
Interested in learning more? Reach out to one of our experts today!
Aimpoint Digital is a market-leading analytics firm at the forefront of solving the most complex business and economic challenges through data and analytical technology. From integrating self-service analytics to implementing AI at scale and modernizing data infrastructure environments, Aimpoint Digital operates across transformative domains to improve the performance of organizations. Connect with our team and get started today.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.