Databricks Community

albert_herrando · ‎05-16-2025

Hello,

I am currently trying to deploy a HuggingFace LLM model to Databricks with the MLflow task llm/v1/chat in order to use it as a chat.

I have tried several models like:

However, once deployed, the models act very weirdly:

The code that I am using to log the models into Unity Catalog is the following:

%pip install transformers
%pip install torch
%pip install accelerate
%pip install torchvision
dbutils.library.restartPython()

import mlflow
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import transformers
import torch
from huggingface_hub import ModelCard

model_id = "TinyLlama/TinyLlama_v1.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

transformers_model = {"model": model, "tokenizer": tokenizer}

# The signature will be automatically inferred using the input_example by MLflow for llm/v1/chat
input_example = {
    "messages": [
        {
            "role": "user",
            "content": "Hello!"
        }
    ],
    # These are optional parameters for the llm/v1/chat endpoint
    # "temperature": 0.6,
    # "max_tokens": 300
}

# --- Unity Catalog Setup ---
# Make sure the catalog and schema exist in Unity Catalog
uc_catalog = "dts_proves_pre"
uc_schema = "llms"
registered_model_name = f"{uc_catalog}.{uc_schema}.TinyLlama_v1-1"

# Configure MLflow to use Unity Catalog
mlflow.set_registry_uri("databricks-uc")

# Log model
with mlflow.start_run():
  model_info = mlflow.transformers.log_model(
      transformers_model=transformers_model,
      task = "llm/v1/chat",
      model_card = ModelCard.load(model_id),
      artifact_path="TinyLlama_v1.1-model",
      # signature=signature, # The signature will be automatically inferred using the input_example by MLflow for llm/v1/chat
      input_example=input_example,
      registered_model_name=registered_model_name,
      extra_pip_requirements=["transformers", "torch", "torchvision", "accelerate"]
  )

I am encountering this problem with several LLMs from HuggingFace. It seems that there is a mismatch when the prompt is generated or the chat template is not properly applied.

Does anyone know what is happening or how to solve it?

Thank you very much in advance.

mark_ott · a month ago

Deploying HuggingFace LLM models to Databricks using MLflow’s llm/v1/chat task sometimes results in unexpected chat behaviors, usually due to prompt/template mismatches, model configuration issues, or pipeline setup requirements. Here’s a direct answer and a detailed guide to troubleshoot and resolve this issue.

Deploying through MLflow for llm/v1/chat expects chat-ready models that use compatible prompt/chat templates. Many HuggingFace models, including TinyLlama_v1.1 and salamandra-7b-instruct, may not natively expose a chat template or require additional setup for chat-style prompting. This often leads to models generating outputs that seem "weird" or do not behave as expected for chat completion tasks.

Common Causes

Missing Chat Template: Not all HuggingFace models come with integrated chat templates. The MLflow llm/v1/chat interface expects the model or its pipeline to handle incoming messages formatted for chat, using user/assistant roles. Without a template, user prompts are not wrapped correctly for the model’s expected input, causing poor or incoherent results.
Model Configuration Issues: Some models require custom configuration for conversation history, roles, or prompt gen to behave as a chat endpoint.
Pipeline Mismatch: The transformers pipeline for text-generation doesn’t automatically apply chat templates like those used by OpenChat, Llama-2, or other instruct-tuned models.

How to Fix

1. Apply the Correct Chat Template Manually

Check if the model (on its HuggingFace page) documents the expected prompt format for chat. For most instruction/assistant models, you need to wrap user messages like:

python

# Example for tools that expect system/user/assistant format
prompt = "<|system|>You are a helpful assistant.<|user|>Hello!<|assistant|>"
inputs = tokenizer(prompt, return_tensors='pt')
output = model.generate(**inputs)
response = tokenizer.decode(output[0])

For salamandra-7b-instruct and similar models, check their HuggingFace cards or README for specific templates.

2. Use a Custom Wrapper for MLflow

Override the default behavior by defining a custom model wrapper or using MLflow’s custom pyfunc:

Prepare the pipeline that applies the chat template before passing to the model.
Log this wrapped handler with MLflow so the correct workflow is used when the model is called through the MLflow chat endpoint.

python

class ChatModelWrapper:
    def __init__(self, model, tokenizer, template_str):
        self.model = model
        self.tokenizer = tokenizer
        self.template_str = template_str

    def predict(self, messages):
        # Build prompt from messages and template
        prompt = self.template_str.format(messages)
        inputs = self.tokenizer(prompt, return_tensors="pt")
        output = self.model.generate(**inputs)
        return self.tokenizer.decode(output[0])

3. Choose Models with Native Chat Support

If possible, select models that natively support chat templates, such as:

Llama-2-Chat (meta-llama/Llama-2-7b-chat-hf)
OpenChat
Mixtral and some other Instruct models
Their HuggingFace cards will confirm compatibility with llm/v1/chat and often detail their prompt structure.

Additional Best Practices

Always check your model’s card on HuggingFace for the recommended prompt format and any special instructions for inference or chat usage.
Test generation locally before logging to Unity Catalog.
Confirm that model pipeline parameters (temperature, max tokens) are set appropriately.

References

[HuggingFace Models and Prompt Templates]
[Databricks and MLflow Transformer Model Deployment Guide]

If you deploy a model to MLflow with llm/v1/chat that does not natively support chat-style prompting, you need to manually apply the chat template or wrap your model so user messages are formatted correctly. For best results, either choose chat-tuned models or add a middleware layer that formats prompts according to the model’s requirements before logging and deploying with MLflow.