Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
Showing results for 
Search instead for 
Did you mean: 
New Contributor II
New Contributor II

image (9).png

This is the second part of a three-part guide on MLflow in the MLOps Gym series. In Part 1, “Beginners’ Guide to MLflow”, we covered Tracking and Model Registry components. In this article, we will focus on the Evaluate component, which is one of the MLflow tools designed to aid in Large Language Model Operations. The remaining components, AI Gateway and Prompt Engineering UI, will be covered in Part 3. 

What is LLMOps and why is it important?

LLMOps, or Large Language Model Operations, is an emerging area in the field of artificial intelligence that focuses on the deployment, monitoring, management, and scaling of large language models (LLMs) like DBRX, Llama3, GPT-4, BERT, and more. LLMs have shown remarkable capabilities in understanding and generating human-like text, which makes them incredibly valuable across various applications such as chatbots, content creation, and language translation. However, the complexity of these models, coupled with their substantial computational requirements, necessitates specialized operational strategies to utilize them effectively.

Databricks is built with LLMOps in mind by integrating open-source tooling, such as MLflow, to streamline the operational aspects of large language models. Databricks alleviates common problems arising from intensive resource consumption, the need for ongoing training, difficulties with evaluating outputs, and the management of ethical considerations such as bias and fairness. The platform's native support for popular ML frameworks used in LLM development, alongside MLflow's capabilities for experiment tracking, model versioning, and lifecycle management, equips teams to effectively train, deploy, and monitor LLMs at scale. Additionally, Unity Catalog allows for tight governance across all data assets, lineage tracking, feature tables, and much more. 

How is LLMOps different than MLOps

LLMOps represents a specialized subset within the broader domain of MLOps, which is concerned with the end-to-end lifecycle management of Machine Learning systems. LLMOps focuses specifically on the unique challenges posed by large language models. Here are the top challenges when operationalizing LLMs:

  • Size of models and data: LLMs are characterized by their vast number of parameters, enormous data requirements, and significant computational resources for training and inference. It is also difficult to properly track model outputs and evaluate overall performance.

  • Working with unstructured data: In contrast to general MLOps, LLMOps deals with intricacies like managing the extensive preprocessing pipelines for text data.

  • Different training techniques: Modeling for LLMs differ from traditional machine learning in scale, architecture, generalization, context handling, and reliance on transfer learning. Ensuring data privacy in language models, handling the complexities of transfer learning and fine-tuning on domain-specific information, and dealing with the nuances of natural language understanding and generation are challenges that are specific to LLMOps.

  • Responsible AI: Additionally, LLMOps must consider the ethical implications and potential biases within language models, emphasizing the importance of responsible AI practices.

Databricks, MLflow, and GenAI

MLflow has grown to include many tools for working with LLMs including native flavors for some of the most popular packages. These flavors include Transformers (HuggingFace), OpenAI, Sentence Transformers, and Langchain! 

Langchain Flavor:

with mlflow.start_run():
    model_info = mlflow.langchain.log_model(chain, "langchain_model")

Transformers Flavor (HuggingFace):

with mlflow.start_run():

    model_info = mlflow.transformers.log_model(

Complementing the advancements of MLflow, Databricks has released a breadth of documentation and blogs including the Big Book of MLOps which includes information on generative AI, RAG, and more.

Now, let’s start going deeper into the concepts mentioned already, starting with LLM Evaluation. 


Large Language Models have changed the technologic landscape immensely, but how do we decide which models to use? Which prompts are determined to be good? How do we make sure that we are using models correctly and efficiently?

Evaluating LLMs is a continuously evolving field, but MLflow makes it easy to aggregate metrics and even use additional LLMs to judge the outputs of your existing models. Using mlflow.evaluate() we can use specific configurations for our evaluator to gather key information about our models’ performance. 

Let’s take a deeper look. 

Calculating and Collecting Metrics

MLflow has a number of default evaluators for specific model types including question-answering, text-summarization, text-generation, and retrievers. You can invoke these evaluators by calling mlflow.evaluate() and setting the model_type parameter, as seen in the following code snippet. 

    model=logged_model.model_uri, model_type="question-answering", data=questions

Function-based metrics measure the effectiveness of LLMs in NLP tasks by taking into consideration toxicity, quality of text, readability, relevance, etc. These metrics assess the models' ability to predict correct outcomes, handle errors, and provide meaningful results in tasks like sentiment analysis, named entity recognition, part-of-speech tagging, machine translation, and language modeling.

By default, MLflow will collect function-based metrics associated with the model_type and surface them within your experiment along with the rest of the run information:


Some of the common metrics include:

  • Exact-match - This metric measures the percentage of predictions that exactly match the ground truth, often used in question-answering tasks.
  • Toxicity - This metric measures the level of toxicity in a text, often used in content moderation to ensure language models do not generate harmful content.
  • ROUGE - This metric stands for Recall-Oriented Understudy for Gisting Evaluation and is used to automatically determine the quality of summaries by comparing them to reference summaries.
  • ARI Grade Level - This is a readability metric that estimates the U.S. grade level (1-12) needed to understand a text. It's used to evaluate the readability of generated text, ensuring it's appropriate for the intended audience.

Function-Based Custom Metrics

Additionally, you can create custom function-based metrics that you can define to log specific metrics within your runs. For example, latency is defined as the time it takes to generate a prediction for a given input. This is important when determining if the model is fast enough for a given application or if we need to rethink our model selection, and we can include latency in our evaluation by passing mlflow.metrics.latency() to extra_metrics:

results = mlflow.evaluate(

You can remove the model_type if you want to log only your custom metrics. The MLflow documentation provides more information about the supported evaluation metrics.


An alternative approach to evaluating LLMs is to utilize an LLM-as-a-judge. This method involves using the output from your model and generating scores based on the defined metric for the LLM judge. Evaluating LLMs using other LLM judges has become a prominent research topic. Our insights on this topic are presented in the blog post Best Practices for LLM Evaluation of RAG Applications

MLflow has the capability to capture metrics generated by other LLMs. Currently, you can measure:

  • answer_similarity - measures how similar output is to provided ground_truth
  • answer_correctness - measures how factually correct the output is based on ground_truth
  • answer_relevance - measures how relevant the output is to the input prompt only
  • relevance - measures how relevant the output is to both the input and context
  • faithfulness - measures how faithful the model is to the context

You can configure these metrics by first using mflow.metrics.genai to import the appropriate metrics, associate an LLM (we are using Databricks Foundation Models API), and pass the information to mlflow.evaluate():

from mlflow.metrics.genai import relevance, EvaluationExample

relevance_metric = relevance(model="endpoints:/databricks-llama-2-70b-chat")

results = mlflow.evaluate(
    extra_metrics=[faithfulness_metric, relevance_metric, mlflow.metrics.latency()],
        "col_mapping": {
            "inputs": "questions",
            "context": "source_documents",


Output table split into two images for readability:


Custom GenAI Metrics

You can also create custom LLM-as-a-judge metrics within MLflow using mlflow.metrics.genai.make_genai_metric(). You can do this by first creating an EvaluationExample.

Note: You can access the full notebook examples here.

example = EvaluationExample(
    input="What is MLflow?",
    output=("MLflow is an open-source platform for managing machine "),
        "The definition effectively explains what MLflow is "
        "its purpose, and its developer. It could be more concise for a 5-score.",
        "targets": (
            "MLflow is an open-source platform for managing "
            "the end-to-end machine learning (ML) lifecycle. It was developed by Databricks"

You then use this example within make_genai_metric() which can be used like other metrics within mlflow.evaluate():

metric = make_genai_metric(
        "Answer correctness is evaluated on the accuracy of the provided output based on "
        "the provided targets, which is the ground truth..."
        "Answer correctness: Below are the details for different scores:"
        "- Score 1: The output is completely incorrect. It is completely different from "
        "or contradicts the provided targets."
        "- Score 5: The output is correct. It demonstrates a high degree of accuracy and "
        "semantic similarity to the targets."
    parameters={"temperature": 0.0},
    aggregations=["mean", "variance", "p90"],


Throughout this article, we've delved into how LLMOps focuses on the unique challenges of deploying, monitoring, and managing LLMs such as DBRX and Llama3.

Our exploration covered the native support within MLflow for LLM evaluation and associated metrics so that we can better assess our LLMs’ performance. We discussed the use of default and custom function-based metrics, as well as advanced techniques like the LLM-as-a-judge approach for comprehensive model assessment. Databricks includes all of the tools needed for developing, deploying, and managing your large language models. 

Coming up next!

Next blog in this series: MLOps Gym - IDE vs Notebooks for ML Workloads