This is the second part of a three-part guide on MLflow in the MLOps Gym series. In Part 1, “Beginners’ Guide to MLflow”, we covered Tracking and Model Registry components. In this article, we will focus on the Evaluate component, which is one of the MLflow tools designed to aid in Large Language Model Operations. The remaining components, AI Gateway and Prompt Engineering UI, will be covered in Part 3.
LLMOps, or Large Language Model Operations, is an emerging area in the field of artificial intelligence that focuses on the deployment, monitoring, management, and scaling of large language models (LLMs) like DBRX, Llama3, GPT-4, BERT, and more. LLMs have shown remarkable capabilities in understanding and generating human-like text, which makes them incredibly valuable across various applications such as chatbots, content creation, and language translation. However, the complexity of these models, coupled with their substantial computational requirements, necessitates specialized operational strategies to utilize them effectively.
Databricks is built with LLMOps in mind by integrating open-source tooling, such as MLflow, to streamline the operational aspects of large language models. Databricks alleviates common problems arising from intensive resource consumption, the need for ongoing training, difficulties with evaluating outputs, and the management of ethical considerations such as bias and fairness. The platform's native support for popular ML frameworks used in LLM development, alongside MLflow's capabilities for experiment tracking, model versioning, and lifecycle management, equips teams to effectively train, deploy, and monitor LLMs at scale. Additionally, Unity Catalog allows for tight governance across all data assets, lineage tracking, feature tables, and much more.
LLMOps represents a specialized subset within the broader domain of MLOps, which is concerned with the end-to-end lifecycle management of Machine Learning systems. LLMOps focuses specifically on the unique challenges posed by large language models. Here are the top challenges when operationalizing LLMs:
MLflow has grown to include many tools for working with LLMs including native flavors for some of the most popular packages. These flavors include Transformers (HuggingFace), OpenAI, Sentence Transformers, and Langchain!
Langchain Flavor:
|
Transformers Flavor (HuggingFace):
|
Complementing the advancements of MLflow, Databricks has released a breadth of documentation and blogs including the Big Book of MLOps which includes information on generative AI, RAG, and more.
Now, let’s start going deeper into the concepts mentioned already, starting with LLM Evaluation.
Large Language Models have changed the technologic landscape immensely, but how do we decide which models to use? Which prompts are determined to be good? How do we make sure that we are using models correctly and efficiently?
Evaluating LLMs is a continuously evolving field, but MLflow makes it easy to aggregate metrics and even use additional LLMs to judge the outputs of your existing models. Using mlflow.evaluate() we can use specific configurations for our evaluator to gather key information about our models’ performance.
Let’s take a deeper look.
MLflow has a number of default evaluators for specific model types including question-answering, text-summarization, text-generation, and retrievers. You can invoke these evaluators by calling mlflow.evaluate() and setting the model_type parameter, as seen in the following code snippet.
|
Function-based metrics measure the effectiveness of LLMs in NLP tasks by taking into consideration toxicity, quality of text, readability, relevance, etc. These metrics assess the models' ability to predict correct outcomes, handle errors, and provide meaningful results in tasks like sentiment analysis, named entity recognition, part-of-speech tagging, machine translation, and language modeling.
By default, MLflow will collect function-based metrics associated with the model_type and surface them within your experiment along with the rest of the run information:
Some of the common metrics include:
Additionally, you can create custom function-based metrics that you can define to log specific metrics within your runs. For example, latency is defined as the time it takes to generate a prediction for a given input. This is important when determining if the model is fast enough for a given application or if we need to rethink our model selection, and we can include latency in our evaluation by passing mlflow.metrics.latency() to extra_metrics:
|
You can remove the model_type if you want to log only your custom metrics. The MLflow documentation provides more information about the supported evaluation metrics.
An alternative approach to evaluating LLMs is to utilize an LLM-as-a-judge. This method involves using the output from your model and generating scores based on the defined metric for the LLM judge. Evaluating LLMs using other LLM judges has become a prominent research topic. Our insights on this topic are presented in the blog post Best Practices for LLM Evaluation of RAG Applications.
MLflow has the capability to capture metrics generated by other LLMs. Currently, you can measure:
You can configure these metrics by first using mflow.metrics.genai to import the appropriate metrics, associate an LLM (we are using Databricks Foundation Models API), and pass the information to mlflow.evaluate():
|
Output table split into two images for readability:
You can also create custom LLM-as-a-judge metrics within MLflow using mlflow.metrics.genai.make_genai_metric(). You can do this by first creating an EvaluationExample.
Note: You can access the full notebook examples here.
You then use this example within make_genai_metric() which can be used like other metrics within mlflow.evaluate(): |
|
Next blog in this series: MLOps Gym - IDE vs Notebooks for ML Workloads
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.