cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
FrancisLaurens
Databricks Employee
Databricks Employee

Introduction

Multimodal AI models can process multiple types of data simultaneously, such as images, audio and text. This capability enables more intuitive interactions with AI systems like chatbots and virtual assistants. This blog will show how to install a multi-modal model in the databricks model serving as a custom model. Mistral Small-3.1-24B-Instruct-2503 in the current context is a good model for vision understanding (see benchmark on Hugging Face) and long context capabilities, supporting up to 128k tokens without compromising text performance. With 24 billion parameters, this model ranks well in both text and vision tasks. It is also an Open Source multilingual model (under a business friendly Apache 2.0 license), built in Europe. In this blog post, you will explore how to use Small-3.1-24B-Instruct-2503 to generate image descriptions.

Retrieve the Small-3.1-24B-Instruct-2503 from HuggingFace and make inferences within a Databricks notebook

All code shared in this post has been tested with DBR 15.4 LTS ML. To run this code and load the Mistral Small-3.1-24B-Instruct-2503 into GPU memory, we recommend using an instance with at least 50GB of GPU memory to avoid memory issues. Suitable options include the “g6.12xlarge [L4]” on AWS or “Standard_NC24ads_A100_v4” on Microsoft Azure.

You will need to install the following python libraries in your Databricks notebook and restart your python environment:

%pip install transformers==4.50.0
%pip install mlflow torch==2.5.0 torchvision pillow accelerate==0.31.0

dbutils.library.restartPython()

To start using the Mistral Small-3.1-24B-Instruct-2503 model, begin by creating an account on the Hugging Face website if you don't already have one. Once you have an account, generate a Hugging Face token, which is necessary for accessing models.

Next, visit the Mistral Small-3.1-24B-Instruct-2503 model page on the Hugging Face website. Since this is a gated model, you will need to request access. After your request is approved, you can proceed to the next step. Finally, open your Databricks notebook and run the necessary code to download the model from Hugging Face and load it into your environment:

# Standard library
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'  # Suppress TensorFlow logging

# Third-party libraries
import mlflow
import mlflow.pyfunc
import numpy as np
import pandas as pd
import json
from PIL import Image

# Deep learning/ML libraries
import torch
from huggingface_hub import login
from transformers import AutoProcessor, AutoTokenizer, AutoModelForImageTextToText

# Set Huggingface token to access the Mistral AI gated model (see https://huggingface.co/docs/hub/en/security-tokens)
os.environ["HF_TOKEN"] = 'WRITE_THERE_YOUR_HUGGINGFACE_TOKEN'

To make sure that you have changed the HF_TOKEN variable to your own access token, you can run the following code: 

hf_token = os.getenv("HF_TOKEN")
assert hf_token != 'WRITE_THERE_YOUR_HUGGINGFACE_TOKEN', "Please change the HF_TOKEN to your own access token in the previous cell."

You can then run the following code to load the model: 

# Define model ID for the Mistral AI model
model_id = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"

# Load the processor and tokenizer for the model
processor = AutoProcessor.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the Mistral model with specified configurations
model = AutoModelForImageTextToText.from_pretrained(model_id, device_map="balanced_low_0", torch_dtype=torch.bfloat16)

# Set the pad token ID for the model's generation configuration
model.generation_config.pad_token_id = tokenizer.pad_token_id

Once the model is loaded, you can make an inference from this model in your Databricks notebook. With the following code, you will ask the model to give you a short description of an image accessed via this url:  

# Standard library
import base64
import requests
import io

# Function to convert an image into base64
def pillow_image_to_base64(img):
   buffered = io.BytesIO()
   img.save(buffered, format="JPEG")
   return base64.b64encode(buffered.getvalue()).decode("utf-8")

# Function to encode the image that could be used with your images in your volume
def encode_image(image_path):
 with open(image_path, "rb") as image_file:
   return base64.b64encode(image_file.read()).decode('utf-8')

# URL of the image to be described
url = "https://picsum.photos/id/237/400/300"
image = Image.open(requests.get(url, stream=True).raw)
base64_image = pillow_image_to_base64(image)

# Getting the base64 string sample function with an image in your volume instead as an example
# base64_image = encode_image(image_path_volumes)

input_example = pd.DataFrame(
{"messages": [[
	{
		"role": "user",
		"content": [
			{"type": "text", "text": "What is in this image?"},
			{"type": "image", "url": "data:" + f"image/jpeg;base64,{base64_image}"}
           ]
	},
	{
		"role": "system",
		"content": [
			{"type": "text", "text": "You are a helpful chatbot."}
		]
	}
]]})

messages = input_example["messages"].tolist()

# Process the input and move to the model's device
inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

# Generate the model's response
generated_ids = model.generate(**inputs, max_new_tokens=500)

# Decode the generated tokens to get the output text
decoded_output = processor.batch_decode(generated_ids[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0]
print(decoded_output)

For the text instruction “What is in this image?” with the following image and  “You are a helpful chatbot.” as system instruction, the model answers : “The image features a black puppy lying on a wooden floor. The puppy is looking directly at the camera with a friendly and curious expression. The puppy has a fluffy coat and appears to be a young dog, likely a Labrador Retriever or a similar breed, given its appearance. The wooden floor has a rustic, weathered look, adding a warm and cozy atmosphere to the image.”. You can find more information on how to define the input of this multimodal model in this link.

You could also use this model with just giving it a text as input:

input_example = pd.DataFrame(
{"messages": [[
	{
		"role": "user",
		"content": [
			{"type": "text", "text": "What is the capital of France?"}]
	},
	{
		"role": "system",
		"content": [
			{"type": "text", "text": "You are a helpful chatbot."}
		]
	}
]]})

messages = input_example["messages"].tolist()

# Process the input and move to the model's device
inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

# Generate the model's response
generated_ids = model.generate(**inputs, max_new_tokens=500)

# Decode the generated tokens to get the output text
decoded_output = processor.batch_decode(generated_ids[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0]
print(decoded_output)

For the instruction “What is the capital of France ?”, the model answers : “The capital of France is Paris.” 


Import Mistral Small-3.1-24B-Instruct-2503 into Unity Catalog

In this section, you will see how to import and store this model into Unity Catalog. Using Unity Catalog can help you have centralized access control across Databricks workspaces. To achieve this, we will use the MLflow pyfunc flavor to create a custom class, which will serve as our model wrapper. The choice of pyfunc is necessary here because it provides a universal interface for integrating models from any machine learning framework into MLflow. This flexibility allows you to encapsulate custom logic within the model itself. By using pyfunc, we ensure that our model can be deployed seamlessly across various environments supported by MLflow while maintaining a consistent API for interaction. This approach simplifies model deployment and enhances adaptability for diverse use cases. You can create the custom class with the MLFlow pyfunc flavor with this following code:

class ChatModelWithImage(mlflow.pyfunc.PythonModel):
	"""
	A custom model that generates text responses to user multi-model queries.
	"""
	def predict(self, context, model_input):
		"""
		Generates text responses to user queries.
		@param context: The context passed to the model.
		@param model_input: The input data for the model.
		"""
		import torch

		messages = None
	
		# 1. If input is already a DataFrame
		if isinstance(model_input, pd.DataFrame):
			if 'messages' in model_input:
				messages = model_input['messages'].tolist()
			else:
				raise KeyError("'messages' column not found in DataFrame.")
		else:
			# 2. Try to convert to DataFrame
			try:
				model_input = pd.DataFrame(model_input)
				if 'messages' in model_input:
					messages = model_input['messages'].tolist()
				else:
					raise KeyError("'messages' column not found after conversion.")
			except Exception:
				# 3. Try to parse as JSON
				try:
					model_input = json.loads(model_input)
					messages = model_input.get('messages')
					if messages is None:
						raise KeyError("'messages' key not found in JSON input.")
				except Exception as e:
					print(f"Failed to parse input: {e}")
					raise ValueError("Input could not be processed as DataFrame or JSON.")
	
		# Process the input and move to the model's device
		inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
	
		# Generate the model's response
		generated_ids = model.generate(**inputs, max_new_tokens=500)

		# Decode the generated tokens to get the output text
		decoded_output = processor.batch_decode(generated_ids[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0]

		# Clear CUDA cache
		torch.cuda.empty_cache()

		return decoded_output

In this class, there is a process to convert the input not only to make it more flexible but also because later you will create a model serving endpoint that requires input data in structured JSON format.

You can test this Python class in your Databricks notebook via the following code:

# Instantiate the custom chat model
chat_model = ChatModelWithImage()

# Define the input data
input_example = {"messages": [[
	{
		"role": "user",
		"content": [
			{"type": "text", "text": "What is in this image?"},
			{"type": "image", "url":"https://picsum.photos/id/237/400/300"}
           ]
	},
	{
		"role": "system",
		"content": [
			{"type": "text", "text": "You are a helpful chatbot."}
		]
	}
]]}

json_payload = json.dumps(input_example)

# Generate the model's response to the input data
response = chat_model.predict(None, json_payload)

# Print the response from the model
print(response)

It gives a similar result to what you have done previously. Optionally, you can also try it without an image: 

# Define the input data
input_example = pd.DataFrame({"messages": [[
	{
		"role": "user",
		"content": [
			{"type": "text", "text": "What is the capital of France?"}
           ]
	},
	{
		"role": "system",
		"content": [
			{"type": "text", "text": "You are a helpful chatbot."}
		]
	}
]]})

# Generate the model's response to the input data
response = chat_model.predict(None, input_example)

# Print the response from the model
print(response)

Then, you will create a signature to define input schema and output schema for the model. It will be useful when running the MLflow experiment:

# Import modules
from mlflow.models import infer_signature
from mlflow.models import ModelSignature
from mlflow.types.schema import Schema, ColSpec

# Define an input example
input_example = pd.DataFrame({"messages": [[
	{
		"role": "user",
		"content": [
			{"type": "text", "text": "What is in this image?"},
			{"type": "image", "url":"https://picsum.photos/id/237/400/300"}
           ]
	},
	{
		"role": "system",
		"content": [
			{"type": "text", "text": "You are a helpful chatbot."}
		]
	}
]]})

# Infer the model signature
signature = infer_signature(input_example, chat_model.predict(None, input_example))
response = chat_model.predict(None, input_example)

It is now time to create a MLflow experiment with this model and signature:

# Log the model with mlflow
with mlflow.start_run():
	logged_model_info = mlflow.pyfunc.log_model(
	"Mistral-Small-3_1-24B-Instruct-2503",
	python_model=ChatModelWithImage(),
	input_example=input_example,
	signature=signature,
	extra_pip_requirements=["transformers==4.50.0", "torch==2.5.0", "torchvision", "pillow", "accelerate==0.31.0", "pandas", "numpy"])

Based on this MLflow experiment and related run, you will load the model in Unity Catalog:

# Update catalog and schema name
catalog_name = "your_catalog"
schema_name = "your_schema"

# Get the run ID of the logged model
run_id = logged_model_info.run_id

# Set the MLflow registry URI to Databricks
mlflow.set_registry_uri("databricks-uc")

# Register the model in the MLflow registry
model_version_obj = mlflow.register_model(
   model_uri=f"runs:/{run_id}/Mistral-Small-3_1-24B-Instruct-2503",
   name=f"{catalog_name}.{schema_name}.Mistral-Small-3_1-24B-Instruct-2503"
)

Your model is now registered in Unity Catalog, granting you the flexibility to manage access within your Databricks account. You are now ready to proceed to the critical deployment phase using Mosaic AI Model Serving. Mosaic AI Model Serving offers capabilities, such as autoscaling to handle varying workloads efficiently, a unified interface for managing all models—including custom-built ones—and the integration of an AI Gateway for governance and monitoring. These features make it possible to transform your model into a production-ready service capable of handling real-time prediction requests at scale. The next section will guide you through leveraging these features to deploy your model effectively.

Leverage Mosaic AI Model Serving to deploy Mistral Small-3.1-24B-Instruct-2503 model from Unity Catalog

Model Serving in Databricks is a fully managed service that allows you to deploy machine learning models as REST APIs. This makes them easily accessible for real-time predictions without worrying about the underlying infrastructure or scaling. To learn more, please consult the resource available at this link. As the model is now available in Unity Catalog, it is possible to deploy with Mosaic AI Model Serving. We recommend you to use “MULTIGPU_MEDIUM” on AWS or “GPU Large” on Azure minimum for inference serving as the model has to fit into the GPU memory. For test purposes, we recommend you to use “Small” for the compute scale out type. You have to consider that the performance of this serving is not as optimized as for provisioned throughput models. Databricks recommends provisioned throughput for production workloads, and it provides optimized inference for foundation models with performance guarantees.

You could do it through the UI or with the following code: 

# Import module
from mlflow.deployments import get_deploy_client

# Get the deployment client for Databricks
client = get_deploy_client("databricks")

# Create the endpoint configuration
endpoint_config = {
   "name": "Mistral-Small-3_1-24B-Instruct-2503_gpu_medium_x4",
   "config": {
       "served_entities": [
           {
               "entity_name": f"{catalog_name}.{schema_name}.Mistral-Small-3_1-24B-Instruct-2503",
               "entity_version": "1",
               "workload_type": "MULTIGPU_MEDIUM",
               "workload_size": "Small",
               "scale_to_zero_enabled": True
           }
       ],
       "traffic_config": {
           "routes": [
               {
                   "served_model_name": "Mistral-Small-3_1-24B-Instruct-2503-1",
                   "traffic_percentage": 100
               }
           ]
       }
   }
}

# Create the endpoint with the specified configuration
endpoint = client.create_endpoint(config=endpoint_config)

The building process of the model serving endpoint will start. Once up and running, you can navigate to the “Model Serving” section and click the endpoint to get the connection details. The first deployment of the model could be quite long as it needs to create the container and that the size of the model is pretty big.
You can use the endpoint to make an inference with image and text:  

# Import module
import pandas as pd
from databricks.sdk import WorkspaceClient

# Initialize the workspace client
workspace_client = WorkspaceClient()

input_example = pd.DataFrame({"messages": [[
{
	"role": "user",
	"content": [
		{"type": "text", "text": "What is in this image?"},
		{"type": "image", "url": "https://picsum.photos/id/237/400/300"}
	]
},
{
	"role": "system",
	"content": [
		{"type": "text", "text": "You are a helpful chatbot."}
	]
}
]]})

response = workspace_client.serving_endpoints.query(
name="Mistral-Small-3_1-24B-Instruct-2503_gpu_medium_x4",
dataframe_records=input_example.to_dict(orient='records'))

print(response.predictions)

Optionally, you could use ai_query to make an inference through a SQL instruction:

%sql

SELECT ai_query('Mistral-Small-3_1-24B-Instruct-2503_gpu_medium_x4',
from_json('{"messages": [
   {"role": "user", "content": [{"type": "text", "text": "What is in this image?"}, {"type": "image", "url": "https://picsum.photos/id/237/400/300"}]},{"role": "system", "content": [{"type": "text", "text": "You are a helpful chatbot."}]}
   ]}',
   'STRUCT<messages: ARRAY<STRUCT<role: STRING, content: ARRAY<STRUCT<type: STRING, text: STRING, url: STRING>>>>>')
) AS prediction

With Mosaic AI Model Serving, deploying this model is both simple, but powerful, taking advantage of the integration with the Lakehouse for governance and security. The benefit also compared to what you did with the Databricks notebook is for instance that you can store the payload of the inferences in a delta table or also that you could use the auto-scaling features to handle the load.

Conclusion

Throughout this blog post, you have managed to retrieve a multimodal model coming from Mistral AI from Hugging Face, load it into Unity Catalog and deploy the model with Mosaic AI Model Serving. Databricks promotes a sovereign vision of : 

  • Your data
  • Your models
  • Your intelligence platform

Databricks defines model-agnostic as the ability for the end user to use the models that best suits their needs is a very open way :  you can definitely use the Mistral models on which you have a lot more extra control but you definitely use other open source models from European startups. It enables you to deal with new use cases on Databricks while respecting governance, sovereignty and security practices that you need to comply with. This is a theme that we are going to expand on in future blog posts.