Databricks Community

thmonte · ‎02-03-2025

I recently followed the blog post on running deepseek llama distilled. I then served it via Serving Endpoints with provisioned throughput. In my use case I am using pydantic-ai to build out some simple agents for testing. It seems with this style of deployment I'm unable to have the agent make multiple tool calls. Once the llm responds with an 'assistant' role if I pass the full message history back in with the response from that tool call then I get the following error:

Model does not support continuing the chat past the first tool call

I believe this has to do with the way the serving endpoints are being configured when using 'llm/v1/chat' but I could be wrong.

Is a way around this to build out the inference configuration manually? Will I lose any functionality?

The only models this currently works on is the foundational models that support Function calling. ex: databricks-meta-llama-3-3-70b-instruct.

Any guidance here would be great!

Alberto_Umana · ‎02-03-2025

Hello @thmonte,

You can define the model signature, including input and output parameters, to ensure that the model can handle the required interactions. This involves specifying parameters such as temperature, max_tokens, stop, and other relevant settings. Make sure that your endpoint is configured with the appropriate provisioned throughput settings to handle the expected load and interactions.

Here's an example:

from mlflow.models import infer_signature
import mlflow

# Define model signature including params
input_example = {"prompt": "What is Machine Learning?"}
inference_config = {
"temperature": 1.0,
"max_new_tokens": 100,
"do_sample": True,
"repetition_penalty": 1.15, # Custom parameter example
}
signature = infer_signature(
model_input=input_example,
model_output="Machine Learning is...",
params=inference_config
)

# Log the model with its details such as artifacts, pip requirements, and input example
with mlflow.start_run() as run:
mlflow.transformers.log_model(
transformers_model={"model": model, "tokenizer": tokenizer},
artifact_path="model",
task="llm/v1/chat",
signature=signature,
input_example=input_example,
registered_model_name="custom_llm_model"
)

thmonte · ‎02-03-2025

Thanks @Alberto_Umana

which one of these controls allows the conversation to continue past the first tool call? Is there documentation on all configurable fields? Also does this still allow override some of these at the client level? Ex. passing in temperature when calling the llm?

inference_config = {
"temperature": 1.0,
"max_new_tokens": 100,
"do_sample": True,
"repetition_penalty": 1.15, # Custom parameter example
}

I did deploy the model in a similar way as you described but did not pass in signature and input_example.

task = "llm/v1/chat"
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

transformers_model = {"model": model, "tokenizer": tokenizer}


with mlflow.start_run():
   model_info = mlflow.transformers.log_model(
       transformers_model=transformers_model,
       artifact_path="model",
       task=task,
       registered_model_name='model_name',
       metadata={
           "task": task,
           "pretrained_model_name": "meta-llama/Llama-3.3-70B-Instruct",
           "databricks_model_family": "LlamaForCausalLM",
           "databricks_model_size_parameters": "8b",
        },
    )

Databricks Community

Tool Calls with Workspace Models

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!