cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
jeanne_choo
Databricks Employee
Databricks Employee

When I first started using the OSS libraries Composer and LLM Foundry for training Large Language Models, I was surprised to see an abstraction I had not seen before — the ComposerModel . I had heard of Pytorch models, and HuggingFace models, but not of Composer models. In fact, the Composer library had several abstractions for interfacing between HuggingFace and Composer. Clearly there were differences between the two. But what exactly were they and why would they be important? I needed to find out.

Background: What is the Composer library?

Composer is a Pytorch-based framework built for training neural-networks in the most efficient and flexible way possible. It does this by abstracting away the complex logic needed to manage distributed training over multiple GPUs (a must for modern-day neural network training), while still allowing developers to implement custom algorithms and callbacks as needed.

Background: Why do we need Composer as a framework on top of Pytorch?

Training neural networks involves managing many things, from the way data is loaded and processed, to how the model is architected, and how the model weights are updated with its optimizers and learning rate schedulers. Add in the need for also managing distributed training across multiple GPUs, and the code needed to orchestrate the training process can become confusing and unmanageable. Not to mention, state-of-the-art techniques are continually being developed, and we may want to augment an existing model architecture to use these modern techniques. Composer streamlines this whole process in a way that is not only easy, but flexible too.

Flexibility

Custom algorithms such as Selective Backprop can speed up the model training process, while others can improve a model in other ways, such as helping it to generalise. With Composer, these custom algorithms can be inserted into various points in the model training loop, such as the data preprocessing step and the model training step.

As an example, using the ColOut technique, we can choose to drop random rows and columns from image data before they are loaded into the model training loop (example from docs). This helps us speed up training because we reduce the size of the image passed to a model. It can also act as a form of data augmentation by increasing the variability of the data the model sees.

Ease of use

With models trending towards the 7B, 13B and 70B parameter range, we not only have to split datasets across multiple GPUs. We have to do the same with models as well. Sharding a model across multiple GPUs involves more than just distributing the model weights. We may also need to shard the model’s optimizer and activations, and manage the updates during training in a distributed manner as well. It’s never a good idea to manage this manually, and this is where Composer’s Trainer object can help - it handles the orchestration of the model training loop for us, leaving us to focus on higher-level decisions such as which learning rate scheduler to use.

Where does the ComposerModel abstraction fit in this picture?

A ComposerModel acts as an interface to the Composer Trainer class. It wraps up a neural network and allows the Trainer to then orchestrating the training loop.

Creating a ComposerModel at minimum means implementing the forward() and loss() methods from the ComposerModel base class. But this is rarely enough, because we usually also want to measure how well our model is performing. For this, we also need to compute metrics, which is done via implementing the eval_forward() , get_metrics and update_metrics methods.

Example of implementing the forward() and loss() methods of the ComposerModel class from the Composer docs:

import torchvision
import torch.nn.functional as F
from composer.models import ComposerModel

class ResNet18(ComposerModel):

 def __init__(self):
        super().__init__()
        self.model = torchvision.models.resnet18()
    
 def forward(self, batch): # batch is the output of the dataloader
        # specify how batches are passed through the model
        inputs, _ = batch
        return self.model(inputs)

 def loss(self, outputs, batch):
        # pass batches and `forward` outputs to the loss
        _, targets = batch
        return F.cross_entropy(outputs, targets)

Example of implementing metrics with the ComposerModel class, from the docs:

class ComposerClassifier(ComposerModel):
    def __init__(self):
        super().__init__()
        self.model = torchvision.models.resnet18()
        self.train_accuracy = torchmetrics.classification.MulticlassAccuracy(num_classes=1000, average='micro')
        self.val_accuracy = torchmetrics.classification.MulticlassAccuracy(num_classes=1000, average='micro')    ...    

    def eval_forward(self, batch, outputs):
        if outputs:
            return outputs
            inputs, _ = batch
            outputs = self.model(inputs)
            return outputs    

    def update_metric(self, batch, outputs, metric):
        _, targets = batch
        metric.update(outputs, targets)    def get_metrics(self, is_train=False):
        # defines which metrics to use in each phase of training
        return {'MulticlassAccuracy': self.train_accuracy} if train else {'MulticlassAccuracy': self.val_accuracy}

Once the model is passed into the Trainer class, Trainerin turn, manages the model training loop in a way that allows us to:

  • implement customised speed-up methods available from the Composer library. For example, dropping random rows and columns from image data before they are loaded into the model training loop (example from docs)

Interfacing between Composer and HuggingFace for Large Language Models

We can see now why we need an model abstraction here. The training loop is tightly integrated with the model architecture and data (and any customisations we make to the model and data). So it makes sense to have two abstractions (ComposerModel and Trainer ) that work well together.

But for large language models focused on text, many OSS models are published on HuggingFace. After training our own model, we may also want to publish the model to HuggingFace. Hence, to interface between Composer and HuggingFace, we can

  • Use hf_causal_lm from LLM Foundry to wrap a HuggingFace model as a Composer object, and train the model using the standard Trainer workflow. If using Mosaic AI MCT, this is the corresponding mcli configuration field:
model:
    name: hf_causal_lm
    init_device: mixed
    pretrained_model_name_or_path: meta-llama/Llama-2-7b-hf
    pretrained: true
    # Note: you must have set the HF_TOKEN environment variable and have access to the llama2 models
    use_auth_token: true
    use_flash_attention_2: true
  • Save a Composer checkpoint in HuggingFace format using hf_checkpointer callback in LLM Foundry
callbacks:
    hf_checkpointer: 
      save_interval: # how often to save a checkpoint. 
      save_folder: # which cloud storage folder to save the checkpoint
      # other arguments


Closing thoughts

My introduction to the ComposerModel class was via Large Language Models. But it’s worth keeping in mind that Composer is a general purpose framework for neural networks in general. It can be used for training vision models, embedding models, and diffusion models. Also, training deep neural networks, especially large language models, can increase exponentially in difficulty with model size. This is where other parts of the Mosaic AI stack, such as Streaming Datasets and LLM Foundry come in. Training large models is not easy, and there are different tools for each part of the job.