<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Deploying HuggingFace LLM model with MLflow task llm/v1/chat into Databricks in Generative AI</title>
    <link>https://community.databricks.com/t5/generative-ai/deploying-huggingface-llm-model-with-mlflow-task-llm-v1-chat/m-p/119472#M889</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I am currently trying to deploy a HuggingFace LLM model to Databricks with the MLflow task llm/v1/chat in order to use it as a chat.&lt;/P&gt;&lt;P&gt;I have tried several models like:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;A href="https://huggingface.co/TinyLlama/TinyLlama_v1.1" target="_blank"&gt;TinyLlama/TinyLlama_v1.1 · Hugging Face&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;&lt;A href="https://huggingface.co/BSC-LT/salamandra-7b-instruct" target="_blank"&gt;BSC-LT/salamandra-7b-instruct · Hugging Face&lt;/A&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;However, once deployed, the models act very weirdly:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="albert_herrando_0-1747401869742.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16908i696A73527321FF3A/image-size/medium?v=v2&amp;amp;px=400" role="button" title="albert_herrando_0-1747401869742.png" alt="albert_herrando_0-1747401869742.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="albert_herrando_1-1747401929720.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16909i48DA01A7A8894F1D/image-size/medium?v=v2&amp;amp;px=400" role="button" title="albert_herrando_1-1747401929720.png" alt="albert_herrando_1-1747401929720.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;The code that I am using to log the models into Unity Catalog is the following:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;%pip install transformers
%pip install torch
%pip install accelerate
%pip install torchvision
dbutils.library.restartPython()

import mlflow
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import transformers
import torch
from huggingface_hub import ModelCard

model_id = "TinyLlama/TinyLlama_v1.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

transformers_model = {"model": model, "tokenizer": tokenizer}

# The signature will be automatically inferred using the input_example by MLflow for llm/v1/chat
input_example = {
    "messages": [
        {
            "role": "user",
            "content": "Hello!"
        }
    ],
    # These are optional parameters for the llm/v1/chat endpoint
    # "temperature": 0.6,
    # "max_tokens": 300
}

# --- Unity Catalog Setup ---
# Make sure the catalog and schema exist in Unity Catalog
uc_catalog = "dts_proves_pre"
uc_schema = "llms"
registered_model_name = f"{uc_catalog}.{uc_schema}.TinyLlama_v1-1"

# Configure MLflow to use Unity Catalog
mlflow.set_registry_uri("databricks-uc")

# Log model
with mlflow.start_run():
  model_info = mlflow.transformers.log_model(
      transformers_model=transformers_model,
      task = "llm/v1/chat",
      model_card = ModelCard.load(model_id),
      artifact_path="TinyLlama_v1.1-model",
      # signature=signature, # The signature will be automatically inferred using the input_example by MLflow for llm/v1/chat
      input_example=input_example,
      registered_model_name=registered_model_name,
      extra_pip_requirements=["transformers", "torch", "torchvision", "accelerate"]
  )

&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I am encountering this problem with several LLMs from HuggingFace. It seems that there is a mismatch when the prompt is generated or the chat template is not properly applied.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Does anyone know what is happening or how to solve it?&lt;/P&gt;&lt;P&gt;Thank you very much in advance.&lt;/P&gt;</description>
    <pubDate>Fri, 16 May 2025 13:31:09 GMT</pubDate>
    <dc:creator>albert_herrando</dc:creator>
    <dc:date>2025-05-16T13:31:09Z</dc:date>
    <item>
      <title>Deploying HuggingFace LLM model with MLflow task llm/v1/chat into Databricks</title>
      <link>https://community.databricks.com/t5/generative-ai/deploying-huggingface-llm-model-with-mlflow-task-llm-v1-chat/m-p/119472#M889</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I am currently trying to deploy a HuggingFace LLM model to Databricks with the MLflow task llm/v1/chat in order to use it as a chat.&lt;/P&gt;&lt;P&gt;I have tried several models like:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;A href="https://huggingface.co/TinyLlama/TinyLlama_v1.1" target="_blank"&gt;TinyLlama/TinyLlama_v1.1 · Hugging Face&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;&lt;A href="https://huggingface.co/BSC-LT/salamandra-7b-instruct" target="_blank"&gt;BSC-LT/salamandra-7b-instruct · Hugging Face&lt;/A&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;However, once deployed, the models act very weirdly:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="albert_herrando_0-1747401869742.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16908i696A73527321FF3A/image-size/medium?v=v2&amp;amp;px=400" role="button" title="albert_herrando_0-1747401869742.png" alt="albert_herrando_0-1747401869742.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="albert_herrando_1-1747401929720.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16909i48DA01A7A8894F1D/image-size/medium?v=v2&amp;amp;px=400" role="button" title="albert_herrando_1-1747401929720.png" alt="albert_herrando_1-1747401929720.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;The code that I am using to log the models into Unity Catalog is the following:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;%pip install transformers
%pip install torch
%pip install accelerate
%pip install torchvision
dbutils.library.restartPython()

import mlflow
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import transformers
import torch
from huggingface_hub import ModelCard

model_id = "TinyLlama/TinyLlama_v1.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

transformers_model = {"model": model, "tokenizer": tokenizer}

# The signature will be automatically inferred using the input_example by MLflow for llm/v1/chat
input_example = {
    "messages": [
        {
            "role": "user",
            "content": "Hello!"
        }
    ],
    # These are optional parameters for the llm/v1/chat endpoint
    # "temperature": 0.6,
    # "max_tokens": 300
}

# --- Unity Catalog Setup ---
# Make sure the catalog and schema exist in Unity Catalog
uc_catalog = "dts_proves_pre"
uc_schema = "llms"
registered_model_name = f"{uc_catalog}.{uc_schema}.TinyLlama_v1-1"

# Configure MLflow to use Unity Catalog
mlflow.set_registry_uri("databricks-uc")

# Log model
with mlflow.start_run():
  model_info = mlflow.transformers.log_model(
      transformers_model=transformers_model,
      task = "llm/v1/chat",
      model_card = ModelCard.load(model_id),
      artifact_path="TinyLlama_v1.1-model",
      # signature=signature, # The signature will be automatically inferred using the input_example by MLflow for llm/v1/chat
      input_example=input_example,
      registered_model_name=registered_model_name,
      extra_pip_requirements=["transformers", "torch", "torchvision", "accelerate"]
  )

&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I am encountering this problem with several LLMs from HuggingFace. It seems that there is a mismatch when the prompt is generated or the chat template is not properly applied.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Does anyone know what is happening or how to solve it?&lt;/P&gt;&lt;P&gt;Thank you very much in advance.&lt;/P&gt;</description>
      <pubDate>Fri, 16 May 2025 13:31:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/deploying-huggingface-llm-model-with-mlflow-task-llm-v1-chat/m-p/119472#M889</guid>
      <dc:creator>albert_herrando</dc:creator>
      <dc:date>2025-05-16T13:31:09Z</dc:date>
    </item>
    <item>
      <title>Re: Deploying HuggingFace LLM model with MLflow task llm/v1/chat into Databricks</title>
      <link>https://community.databricks.com/t5/generative-ai/deploying-huggingface-llm-model-with-mlflow-task-llm-v1-chat/m-p/138161#M1355</link>
      <description>&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Deploying HuggingFace LLM models to Databricks using MLflow’s&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;llm/v1/chat&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;task sometimes results in unexpected chat behaviors, usually due to prompt/template mismatches, model configuration issues, or pipeline setup requirements. Here’s a direct answer and a detailed guide to troubleshoot and resolve this issue.&lt;/P&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Deploying through MLflow for&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;llm/v1/chat&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;expects chat-ready models that use compatible prompt/chat templates. Many HuggingFace models, including&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;TinyLlama_v1.1&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;salamandra-7b-instruct&lt;/CODE&gt;, may not natively expose a chat template or require additional setup for chat-style prompting. This often leads to models generating outputs that seem "weird" or do not behave as expected for chat completion tasks.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Common Causes&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Missing Chat Template&lt;/STRONG&gt;: Not all HuggingFace models come with integrated chat templates. The MLflow&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;llm/v1/chat&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;interface expects the model or its pipeline to handle incoming messages formatted for chat, using user/assistant roles. Without a template, user prompts are not wrapped correctly for the model’s expected input, causing poor or incoherent results.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Model Configuration Issues&lt;/STRONG&gt;: Some models require custom configuration for conversation history, roles, or prompt gen to behave as a chat endpoint.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Pipeline Mismatch&lt;/STRONG&gt;: The&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;transformers&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;pipeline for text-generation doesn’t automatically apply chat templates like those used by OpenChat, Llama-2, or other instruct-tuned models.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;HR /&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;How to Fix&lt;/H2&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;1.&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Apply the Correct Chat Template Manually&lt;/STRONG&gt;&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Check if the model (on its HuggingFace page) documents the expected prompt format for chat. For most instruction/assistant models, you need to wrap user messages like:&lt;/P&gt;
&lt;DIV class="w-full md:max-w-[90vw]"&gt;
&lt;DIV class="codeWrapper text-light selection:text-super selection:bg-super/10 my-md relative flex flex-col rounded font-mono text-sm font-normal bg-subtler"&gt;
&lt;DIV class="translate-y-xs -translate-x-xs bottom-xl mb-xl flex h-0 items-start justify-end md:sticky md:top-[100px]"&gt;
&lt;DIV class="overflow-hidden rounded-full border-subtlest ring-subtlest divide-subtlest bg-base"&gt;
&lt;DIV class="border-subtlest ring-subtlest divide-subtlest bg-subtler"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="-mt-xl"&gt;
&lt;DIV&gt;
&lt;DIV class="text-quiet bg-subtle py-xs px-sm inline-block rounded-br rounded-tl-[3px] font-thin" data-testid="code-language-indicator"&gt;python&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&lt;CODE&gt;&lt;SPAN class="token token"&gt;# Example for tools that expect system/user/assistant format&lt;/SPAN&gt;
prompt &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;"&amp;lt;|system|&amp;gt;You are a helpful assistant.&amp;lt;|user|&amp;gt;Hello!&amp;lt;|assistant|&amp;gt;"&lt;/SPAN&gt;
inputs &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; tokenizer&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;prompt&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; return_tensors&lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;'pt'&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
output &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; model&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;generate&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token operator"&gt;**&lt;/SPAN&gt;inputs&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
response &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; tokenizer&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;decode&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;output&lt;SPAN class="token token punctuation"&gt;[&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;0&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;]&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
&lt;/CODE&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;For&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;salamandra-7b-instruct&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and similar models, check their HuggingFace cards or README for specific templates.&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;2.&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Use a Custom Wrapper for MLflow&lt;/STRONG&gt;&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Override the default behavior by defining a custom model wrapper or using MLflow’s custom pyfunc:&lt;/P&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Prepare the pipeline that applies the chat template before passing to the model.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Log this wrapped handler with MLflow so the correct workflow is used when the model is called through the MLflow chat endpoint.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="w-full md:max-w-[90vw]"&gt;
&lt;DIV class="codeWrapper text-light selection:text-super selection:bg-super/10 my-md relative flex flex-col rounded font-mono text-sm font-normal bg-subtler"&gt;
&lt;DIV class="translate-y-xs -translate-x-xs bottom-xl mb-xl flex h-0 items-start justify-end md:sticky md:top-[100px]"&gt;
&lt;DIV class="overflow-hidden rounded-full border-subtlest ring-subtlest divide-subtlest bg-base"&gt;
&lt;DIV class="border-subtlest ring-subtlest divide-subtlest bg-subtler"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="-mt-xl"&gt;
&lt;DIV&gt;
&lt;DIV class="text-quiet bg-subtle py-xs px-sm inline-block rounded-br rounded-tl-[3px] font-thin" data-testid="code-language-indicator"&gt;python&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&lt;CODE&gt;&lt;SPAN class="token token"&gt;class&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;ChatModelWrapper&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
    &lt;SPAN class="token token"&gt;def&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;__init__&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;self&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; model&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; tokenizer&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; template_str&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
        self&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;model &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; model
        self&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;tokenizer &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; tokenizer
        self&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;template_str &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; template_str

    &lt;SPAN class="token token"&gt;def&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;predict&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;self&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; messages&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
        &lt;SPAN class="token token"&gt;# Build prompt from messages and template&lt;/SPAN&gt;
        prompt &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; self&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;template_str&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;format&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;messages&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
        inputs &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; self&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;tokenizer&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;prompt&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; return_tensors&lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;"pt"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
        output &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; self&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;model&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;generate&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token operator"&gt;**&lt;/SPAN&gt;inputs&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
        &lt;SPAN class="token token"&gt;return&lt;/SPAN&gt; self&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;tokenizer&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;decode&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;output&lt;SPAN class="token token punctuation"&gt;[&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;0&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;]&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
&lt;/CODE&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;3.&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Choose Models with Native Chat Support&lt;/STRONG&gt;&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;If possible, select models that natively support chat templates, such as:&lt;/P&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Llama-2-Chat (&lt;CODE&gt;meta-llama/Llama-2-7b-chat-hf&lt;/CODE&gt;)&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;OpenChat&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Mixtral and some other Instruct models&lt;BR /&gt;Their HuggingFace cards will confirm compatibility with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;llm/v1/chat&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and often detail their prompt structure.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;HR /&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Additional Best Practices&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Always check your model’s card on HuggingFace for the recommended prompt format and any special instructions for inference or chat usage.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Test generation locally before logging to Unity Catalog.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Confirm that model pipeline parameters (temperature, max tokens) are set appropriately.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;HR /&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;References&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;[HuggingFace Models and Prompt Templates]&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;[Databricks and MLflow Transformer Model Deployment Guide]&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;HR /&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;If you deploy a model to MLflow with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;llm/v1/chat&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;that does not natively support chat-style prompting, you need to manually apply the chat template or wrap your model so user messages are formatted correctly. For best results, either choose chat-tuned models or add a middleware layer that formats prompts according to the model’s requirements before logging and deploying with MLflow.&lt;/P&gt;</description>
      <pubDate>Fri, 07 Nov 2025 17:00:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/deploying-huggingface-llm-model-with-mlflow-task-llm-v1-chat/m-p/138161#M1355</guid>
      <dc:creator>mark_ott</dc:creator>
      <dc:date>2025-11-07T17:00:29Z</dc:date>
    </item>
  </channel>
</rss>

