Databricks Community

AnastasiProk · ‎09-28-2023

Authors: Anastasia Prokaieva and Puneet Jain

In our first part, we have covered the main aspects of the data loading using Hugging Face integration with the Spark dataframes and how to use RayAIR to distribute your fine-tuning for BERT model for 2 commonly performed use cases Sequence and Token Classifications.

In our second part we continue to explore RayAIR and how can we leverage the Databricks Lakehouse AI platform to track your model versions and to bring them live with the Real-Time model endpoints. This second part will cover how to:

Tune your model using RayTune
Track and log your LLM with MLFlow
Predict on test data with Ray AIR
Serve your model with Real-Time Endpoint on Databricks using CPU and GPU!
Talk about issues we encountered and solutions we found

Model Tuning

Once you manage to get your code running, oftentimes your default parameters for model training might not be as performant as needed. That's when we require to fine-tune our model, and ideally, we want to do that in a distributed fashion. The choice of a tool is quite broad to perform hyperparameter tuning and is not only limited to our choice in this article. Since we have used Ray to make distributed training, and Ray also has its integrated Tuner, we simply gonna stick to the same framework. Our goal here was not to fine-tune the model to the best parameters possible, but if you would like to do so, here is an example code: Screenshot 2023-07-25 at 12.10.36.png

If you need to check all the results you can run this: Screenshot 2023-07-25 at 12.39.31.png

To get the best model use this command: Screenshot 2023-07-25 at 12.41.26.png

MlFlow from Logging to Tracking to Inference

Once the model is trained, you definitely want to log it to track your parameters and keep that particular version of your model. Recently Databricks has announced a new integration of the MLFlow with a few most popular frameworks when working with LLM and added a new Transformer Flavor. There are multiple ways to keep track of your model, here we are going to re-use a checkpoint from the best_trial_run we got after tuning.

Ray AIR has the option to provide an MLFlow Logger but it’s not logging your models under proper MLFlow format (under the transformers flavors) it just simply stores your checkpoints and files under the Artifact - hence we preferred to log our model ourselves from the checkpoint saved on your local_disk0 (Driver VM that will be cleaned when it’s stopped, you can select to store it under dbfs or Volumes) in order to have a simple way of scoring our model later and being able to keep all parameters tracked.

First, let’s load back our checkpoint: Screenshot 2023-08-04 at 15.29.52.png The next step would be to log your checkpoint into the MLFlow Tracking system and register your best model: Screenshot 2023-09-04 at 16.53.59.png

We are going to use MLFlow Client to register our model and move our latest version to the Staging phase: Screenshot 2023-07-25 at 12.43.55.png

Soon models will be under the UC(UnityCatalog) so instead of Stages you would use Tags.

Image Description: [MLFLow Artifact example after the model was logged]

Here is a very simple way to score your model (something that will work as well, if you set device to the "auto" if you have more than 1 GPU, we have not explored it here): Screenshot 2023-07-25 at 12.45.15.png

Batch Scoring with Ray

Once your model is fine-tuned, you would not be required to re-train it as often as a classical ML model unless your corpus(input data) significantly evolved. Also, this may happen that you do not even need to fine-tune a model, and that an open-source pre-trained model from the Hugging Face Hub fits perfectly into your use case.

Hence, the most important step is to score your model. There are multiple ways you can score your model on Databricks. We have demonstrated the first one above, using MlFlow transformer flavor, we are going to demonstrate another way of scoring with Ray using BatchPredictor and map_batches (you could also make a PandasUDF leveraging multi-node if necessary, BatchPredictor basically does a similar thing).

This is a perfect way to distribute your scoring across multiple instances with GPUs(because you can select more than GPU per task if you have a very big model to score that is slow): Screenshot 2023-07-25 at 12.45.26.png

You would be able to do the same using pandas_udf or Spark UDFs, but here we just demonstrating another way of doing similar things directly with Ray AIR.

Real-Time Serving

This year Databricks has announced its new low-latency Model Serving Endpoints and it’s become GA very quickly. Databircks Model Serving is a managed service with automated infrastructure configuration and maintenance to reduce overheads and accelerate your ML deployments. As of now, Databricks is also offering GPU Serving, and soon there will be Optimized Serving for LLMs, for our small models CPU serving or classic GPU serving is well enough, for very big LLMs the optimized serving or Multi-GPUs are required due to the latency requirements.

Databricks Model Serving accepts different MLFlow FLavours it's even accepting Transformers Flavor, but this is not the most "optimal" way. Here we demonstrate how to wrap the model under the PyFunc function at this time, and that would be one of the most common ways to package and serve your models. This hint is also very useful if you wish to chain multiple models - you would wrap them under one PyFunc Class. Screenshot 2023-07-25 at 12.46.52.png

Once the wrapper is created we will log and register our model back again (you could do only that step and continue directly using only that model): Screenshot 2023-07-25 at 12.47.04.png

Once the model is logged and registered, you just need to enable your Serving Endpoint on Databricks and pass your input data into the endpoint API: Screenshot 2023-08-04 at 15.36.59.png

Also stay tuned, because a few updates are coming very soon, such as inference tables or optimized GPU serving.

Conclusion

In this blog, we have demonstrated how to fine-tune, score, and serve LLM models with a relatively small size on the Databricks Lakehouse platform. We have used Ray AIR as a helping framework that plays the function of a wrapper around various frameworks and is orchestrated by Spark to perform distributed training and inference. Because Ray AIR is playing the role of a framework that connects and unifies tools of your choice it’s a straightforward journey to change your model family type, preprocess your input dataset, and of course configuration files(includes additional parameters that are required to fine-tune your model).

If you like the blog and want to try to train your models with RayAIR check the full code under this [repository].

In the second part of the series, we are going to explore how to fine-tune larger models like Falcon7B or Lamma-2 13B using techniques like DeepSpeed using RayAIR and we also going to show to how to score your models using multi-node set-up on the Databricks Lakehouse Platform.

Known Issues and Solutions

Here we would like to discuss some known issues or issues we have encountered while working on various use cases using Ray AIR with Hugging Face.

CUDA OOM (GPU RAM out of memory)
Your model cannot fit into available GPU RAM, verify your cluster size, or select more nodes.
ValueError: Your setup doesn't support bf16/GPU. You need torch>=1.10, using Ampere GPU with cuda>=11.0
Your instance type does not support bf16 precision for training, you should move into fp16 instead
Instances that support bf16 and lower are A10 and A100, if you are using V100 or T4 use only fp16
insufficient_resources_manager.py:128 -- Ignore this message if the cluster is autoscaling. You asked for 353.0 CPU and 16.0 GPU per trial, but the cluster only has 288.0 CPU and 12.0 GPU.
- You have asked for the wrong amount of resources while setting the ray cluster. Check that your CPU and GPU amount is correct

Sometimes your cluster may not be ready yet, even if the UI states it’s ready, when running ray.init() verify the Clusters page under the Dashboard to check that all resources are provisioned.

Once Ray is enabled and all the CPU/GPU cores are used. All the spark capacity is reserved as a shadow process and spark operations cannot run anymore.
MlFlow Logger from Ray AIR has to be set to save under the local_disk0 and not dbfs/ otherwise you will get cloud provider connection issues

Databricks Community

Distributed Fine Tuning of LLMs on Databricks Lakehouse with Ray AI Runtime, Part 2

Model Tuning

MlFlow from Logging to Tracking to Inference

Batch Scoring with Ray

Real-Time Serving

Conclusion

Known Issues and Solutions

Metadata-Driven ETL Framework in Databricks (Part-1)

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL