When I first started using the OSS libraries Composer and LLM Foundry for training Large Language Models, I was surprised to see an abstraction I had not seen before — the ComposerModel
. I had heard of Pytorch
models, and HuggingFace
models, but not of Composer
models. In fact, the Composer
library had several abstractions for interfacing between HuggingFace
and Composer
. Clearly there were differences between the two. But what exactly were they and why would they be important? I needed to find out.
Composer is a Pytorch-based framework built for training neural-networks in the most efficient and flexible way possible. It does this by abstracting away the complex logic needed to manage distributed training over multiple GPUs (a must for modern-day neural network training), while still allowing developers to implement custom algorithms and callbacks as needed.
Training neural networks involves managing many things, from the way data is loaded and processed, to how the model is architected, and how the model weights are updated with its optimizers and learning rate schedulers. Add in the need for also managing distributed training across multiple GPUs, and the code needed to orchestrate the training process can become confusing and unmanageable. Not to mention, state-of-the-art techniques are continually being developed, and we may want to augment an existing model architecture to use these modern techniques. Composer streamlines this whole process in a way that is not only easy, but flexible too.
Custom algorithms such as Selective Backprop can speed up the model training process, while others can improve a model in other ways, such as helping it to generalise. With Composer, these custom algorithms can be inserted into various points in the model training loop, such as the data preprocessing step and the model training step.
As an example, using the ColOut
technique, we can choose to drop random rows and columns from image data before they are loaded into the model training loop (example from docs). This helps us speed up training because we reduce the size of the image passed to a model. It can also act as a form of data augmentation by increasing the variability of the data the model sees.
With models trending towards the 7B, 13B and 70B parameter range, we not only have to split datasets across multiple GPUs. We have to do the same with models as well. Sharding a model across multiple GPUs involves more than just distributing the model weights. We may also need to shard the model’s optimizer and activations, and manage the updates during training in a distributed manner as well. It’s never a good idea to manage this manually, and this is where Composer’s Trainer
object can help - it handles the orchestration of the model training loop for us, leaving us to focus on higher-level decisions such as which learning rate scheduler to use.
ComposerModel
abstraction fit in this picture?A ComposerModel
acts as an interface to the Composer Trainer
class. It wraps up a neural network and allows the Trainer
to then orchestrating the training loop.
Creating a ComposerModel
at minimum means implementing the forward()
and loss()
methods from the ComposerModel
base class. But this is rarely enough, because we usually also want to measure how well our model is performing. For this, we also need to compute metrics, which is done via implementing the eval_forward()
, get_metrics
and update_metrics
methods.
Example of implementing the forward()
and loss()
methods of the ComposerModel
class from the Composer docs:
|
Example of implementing metrics
with the ComposerModel
class, from the docs:
|
Once the model is passed into the Trainer
class, Trainer
in turn, manages the model training loop in a way that allows us to:
torch.distributed
logic by hand.We can see now why we need an model abstraction here. The training loop is tightly integrated with the model architecture and data (and any customisations we make to the model and data). So it makes sense to have two abstractions (ComposerModel
and Trainer
) that work well together.
But for large language models focused on text, many OSS models are published on HuggingFace. After training our own model, we may also want to publish the model to HuggingFace. Hence, to interface between Composer and HuggingFace, we can
hf_causal_lm
from LLM Foundry to wrap a HuggingFace model as a Composer object, and train the model using the standard Trainer
workflow. If using Mosaic AI MCT, this is the corresponding mcli
configuration field:
|
Composer
checkpoint in HuggingFace
format using hf_checkpointer
callback in LLM Foundry
|
My introduction to the ComposerModel
class was via Large Language Models. But it’s worth keeping in mind that Composer is a general purpose framework for neural networks in general. It can be used for training vision models, embedding models, and diffusion models. Also, training deep neural networks, especially large language models, can increase exponentially in difficulty with model size. This is where other parts of the Mosaic AI stack, such as Streaming Datasets and LLM Foundry come in. Training large models is not easy, and there are different tools for each part of the job.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.