- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
This is the second part of a three-part guide on MLflow in the MLOps Gym series. In Part 1, “Beginners’ Guide to MLflow”, we covered Tracking and Model Registry components. In this article, we will focus on the Evaluate component, which is one of the MLflow tools designed to aid in Large Language Model Operations. The remaining components, AI Gateway and Prompt Engineering UI, will be covered in Part 3.
- What is LLMOps and why is it important?
- Evaluation
- Calculating and Collecting Metrics
- LLM-as-a-Judge
- Summary
- Coming up next!
What is LLMOps and why is it important?
LLMOps, or Large Language Model Operations, is an emerging area in the field of artificial intelligence that focuses on the deployment, monitoring, management, and scaling of large language models (LLMs) like DBRX, Llama3, GPT-4, BERT, and more. LLMs have shown remarkable capabilities in understanding and generating human-like text, which makes them incredibly valuable across various applications such as chatbots, content creation, and language translation. However, the complexity of these models, coupled with their substantial computational requirements, necessitates specialized operational strategies to utilize them effectively.
Databricks is built with LLMOps in mind by integrating open-source tooling, such as MLflow, to streamline the operational aspects of large language models. Databricks alleviates common problems arising from intensive resource consumption, the need for ongoing training, difficulties with evaluating outputs, and the management of ethical considerations such as bias and fairness. The platform's native support for popular ML frameworks used in LLM development, alongside MLflow's capabilities for experiment tracking, model versioning, and lifecycle management, equips teams to effectively train, deploy, and monitor LLMs at scale. Additionally, Unity Catalog allows for tight governance across all data assets, lineage tracking, feature tables, and much more.
How is LLMOps different than MLOps
LLMOps represents a specialized subset within the broader domain of MLOps, which is concerned with the end-to-end lifecycle management of Machine Learning systems. LLMOps focuses specifically on the unique challenges posed by large language models. Here are the top challenges when operationalizing LLMs:
- Size of models and data: LLMs are characterized by their vast number of parameters, enormous data requirements, and significant computational resources for training and inference. It is also difficult to properly track model outputs and evaluate overall performance.
- Working with unstructured data: In contrast to general MLOps, LLMOps deals with intricacies like managing the extensive preprocessing pipelines for text data.
- Different training techniques: Modeling for LLMs differ from traditional machine learning in scale, architecture, generalization, context handling, and reliance on transfer learning. Ensuring data privacy in language models, handling the complexities of transfer learning and fine-tuning on domain-specific information, and dealing with the nuances of natural language understanding and generation are challenges that are specific to LLMOps.
- Responsible AI: Additionally, LLMOps must consider the ethical implications and potential biases within language models, emphasizing the importance of responsible AI practices.
Databricks, MLflow, and GenAI
MLflow has grown to include many tools for working with LLMs including native flavors for some of the most popular packages. These flavors include Transformers (HuggingFace), OpenAI, Sentence Transformers, and Langchain!
Langchain Flavor:
|
Transformers Flavor (HuggingFace):
|
Complementing the advancements of MLflow, Databricks has released a breadth of documentation and blogs including the Big Book of MLOps which includes information on generative AI, RAG, and more.
Now, let’s start going deeper into the concepts mentioned already, starting with LLM Evaluation.
Evaluation
Large Language Models have changed the technologic landscape immensely, but how do we decide which models to use? Which prompts are determined to be good? How do we make sure that we are using models correctly and efficiently?
Evaluating LLMs is a continuously evolving field, but MLflow makes it easy to aggregate metrics and even use additional LLMs to judge the outputs of your existing models. Using mlflow.evaluate() we can use specific configurations for our evaluator to gather key information about our models’ performance.
Let’s take a deeper look.
Calculating and Collecting Metrics
MLflow has a number of default evaluators for specific model types including question-answering, text-summarization, text-generation, and retrievers. You can invoke these evaluators by calling mlflow.evaluate() and setting the model_type parameter, as seen in the following code snippet.
|
Function-based metrics measure the effectiveness of LLMs in NLP tasks by taking into consideration toxicity, quality of text, readability, relevance, etc. These metrics assess the models' ability to predict correct outcomes, handle errors, and provide meaningful results in tasks like sentiment analysis, named entity recognition, part-of-speech tagging, machine translation, and language modeling.
By default, MLflow will collect function-based metrics associated with the model_type and surface them within your experiment along with the rest of the run information:
Some of the common metrics include:
- Exact-match - This metric measures the percentage of predictions that exactly match the ground truth, often used in question-answering tasks.
- Toxicity - This metric measures the level of toxicity in a text, often used in content moderation to ensure language models do not generate harmful content.
- ROUGE - This metric stands for Recall-Oriented Understudy for Gisting Evaluation and is used to automatically determine the quality of summaries by comparing them to reference summaries.
- ARI Grade Level - This is a readability metric that estimates the U.S. grade level (1-12) needed to understand a text. It's used to evaluate the readability of generated text, ensuring it's appropriate for the intended audience.
Function-Based Custom Metrics
Additionally, you can create custom function-based metrics that you can define to log specific metrics within your runs. For example, latency is defined as the time it takes to generate a prediction for a given input. This is important when determining if the model is fast enough for a given application or if we need to rethink our model selection, and we can include latency in our evaluation by passing mlflow.metrics.latency() to extra_metrics:
|
You can remove the model_type if you want to log only your custom metrics. The MLflow documentation provides more information about the supported evaluation metrics.
LLM-as-a-Judge
An alternative approach to evaluating LLMs is to utilize an LLM-as-a-judge. This method involves using the output from your model and generating scores based on the defined metric for the LLM judge. Evaluating LLMs using other LLM judges has become a prominent research topic. Our insights on this topic are presented in the blog post Best Practices for LLM Evaluation of RAG Applications.
MLflow has the capability to capture metrics generated by other LLMs. Currently, you can measure:
- answer_similarity - measures how similar output is to provided ground_truth
- answer_correctness - measures how factually correct the output is based on ground_truth
- answer_relevance - measures how relevant the output is to the input prompt only
- relevance - measures how relevant the output is to both the input and context
- faithfulness - measures how faithful the model is to the context
You can configure these metrics by first using mflow.metrics.genai to import the appropriate metrics, associate an LLM (we are using Databricks Foundation Models API), and pass the information to mlflow.evaluate():
|
Output table split into two images for readability:
Custom GenAI Metrics
You can also create custom LLM-as-a-judge metrics within MLflow using mlflow.metrics.genai.make_genai_metric(). You can do this by first creating an EvaluationExample.
Note: You can access the full notebook examples here.
You then use this example within make_genai_metric() which can be used like other metrics within mlflow.evaluate(): |
|
Coming up next!
Next blog in this series: MLOps Gym - IDE vs Notebooks for ML Workloads
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.