Authors: Anastasia Prokaieva and Puneet Jain
In our first part, we have covered the main aspects of the data loading using Hugging Face integration with the Spark dataframes and how to use RayAIR to distribute your fine-tuning for BERT model for 2 commonly performed use cases Sequence and Token Classifications.
In our second part we continue to explore RayAIR and how can we leverage the Databricks Lakehouse AI platform to track your model versions and to bring them live with the Real-Time model endpoints. This second part will cover how to:
Once you manage to get your code running, oftentimes your default parameters for model training might not be as performant as needed. That's when we require to fine-tune our model, and ideally, we want to do that in a distributed fashion. The choice of a tool is quite broad to perform hyperparameter tuning and is not only limited to our choice in this article. Since we have used Ray to make distributed training, and Ray also has its integrated Tuner, we simply gonna stick to the same framework. Our goal here was not to fine-tune the model to the best parameters possible, but if you would like to do so, here is an example code:
If you need to check all the results you can run this:
To get the best model use this command:
Once the model is trained, you definitely want to log it to track your parameters and keep that particular version of your model. Recently Databricks has announced a new integration of the MLFlow with a few most popular frameworks when working with LLM and added a new Transformer Flavor. There are multiple ways to keep track of your model, here we are going to re-use a checkpoint from the best_trial_run we got after tuning.
Ray AIR has the option to provide an MLFlow Logger but it’s not logging your models under proper MLFlow format (under the transformers flavors) it just simply stores your checkpoints and files under the Artifact - hence we preferred to log our model ourselves from the checkpoint saved on your local_disk0 (Driver VM that will be cleaned when it’s stopped, you can select to store it under dbfs or Volumes) in order to have a simple way of scoring our model later and being able to keep all parameters tracked.
First, let’s load back our checkpoint: The next step would be to log your checkpoint into the MLFlow Tracking system and register your best model:
We are going to use MLFlow Client to register our model and move our latest version to the Staging phase:
Soon models will be under the UC(UnityCatalog) so instead of Stages you would use Tags.
Image Description: [MLFLow Artifact example after the model was logged]
Here is a very simple way to score your model (something that will work as well, if you set device to the "auto" if you have more than 1 GPU, we have not explored it here):
Once your model is fine-tuned, you would not be required to re-train it as often as a classical ML model unless your corpus(input data) significantly evolved. Also, this may happen that you do not even need to fine-tune a model, and that an open-source pre-trained model from the Hugging Face Hub fits perfectly into your use case.
Hence, the most important step is to score your model. There are multiple ways you can score your model on Databricks. We have demonstrated the first one above, using MlFlow transformer flavor, we are going to demonstrate another way of scoring with Ray using BatchPredictor and map_batches (you could also make a PandasUDF leveraging multi-node if necessary, BatchPredictor basically does a similar thing).
This is a perfect way to distribute your scoring across multiple instances with GPUs(because you can select more than GPU per task if you have a very big model to score that is slow):
You would be able to do the same using pandas_udf or Spark UDFs, but here we just demonstrating another way of doing similar things directly with Ray AIR.
This year Databricks has announced its new low-latency Model Serving Endpoints and it’s become GA very quickly. Databircks Model Serving is a managed service with automated infrastructure configuration and maintenance to reduce overheads and accelerate your ML deployments. As of now, Databricks is also offering GPU Serving, and soon there will be Optimized Serving for LLMs, for our small models CPU serving or classic GPU serving is well enough, for very big LLMs the optimized serving or Multi-GPUs are required due to the latency requirements.
Databricks Model Serving accepts different MLFlow FLavours it's even accepting Transformers Flavor, but this is not the most "optimal" way. Here we demonstrate how to wrap the model under the PyFunc function at this time, and that would be one of the most common ways to package and serve your models. This hint is also very useful if you wish to chain multiple models - you would wrap them under one PyFunc Class.
Once the wrapper is created we will log and register our model back again (you could do only that step and continue directly using only that model):
Once the model is logged and registered, you just need to enable your Serving Endpoint on Databricks and pass your input data into the endpoint API:
Also stay tuned, because a few updates are coming very soon, such as inference tables or optimized GPU serving.
In this blog, we have demonstrated how to fine-tune, score, and serve LLM models with a relatively small size on the Databricks Lakehouse platform. We have used Ray AIR as a helping framework that plays the function of a wrapper around various frameworks and is orchestrated by Spark to perform distributed training and inference. Because Ray AIR is playing the role of a framework that connects and unifies tools of your choice it’s a straightforward journey to change your model family type, preprocess your input dataset, and of course configuration files(includes additional parameters that are required to fine-tune your model).
If you like the blog and want to try to train your models with RayAIR check the full code under this [repository].
In the second part of the series, we are going to explore how to fine-tune larger models like Falcon7B or Lamma-2 13B using techniques like DeepSpeed using RayAIR and we also going to show to how to score your models using multi-node set-up on the Databricks Lakehouse Platform.
Here we would like to discuss some known issues or issues we have encountered while working on various use cases using Ray AIR with Hugging Face.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.