LLMs on Databricks are now available to call via LiteLLM. LiteLLM is a library that provides a python client and an OpenAI-compatible proxy for accessing 100+ LLMs with the same input/output formats, making it easy to manage and switch between models from different providers. This includes both hosted models (OpenAI/Azure/Bedrock, etc.) and self-hosted models (Ollama/VLLM/TGI, etc.). LiteLLM also works across different endpoints—chat, completion, embeddings, image generation, etc.
LiteLLM supports models available via the foundation models API, external models, and other chat and embedding models hosted with model serving. This includes chat, completion, and embedding models.
This post will show how to start using LiteLLM with Databricks. We’ll start with a quick discussion of how LiteLLM and Databricks complement each other. Next, we’ll work through a quickstart example of using litellm.completion to call models from the Databricks Foundation Models API. Next, the post gives a demo of the LiteLLM OpenAI Proxy. We’ll use the proxy to call models from different providers and to log usage. The post concludes with some examples and links to other ways of using LiteLLM with Databricks models.
Using LiteLLM with Databricks model serving builds on the flexibility offered by both for managing and deploying LLMs. Databricks provides robust MLOps capabilities, scalable inference, production-ready observability features, and support for various open-weights models and proprietary models from providers like Anthropic and OpenAI.
LiteLLM complements these capabilities with a unified API for numerous LLM providers and local or self-hosted LLM platforms, simplifying the process of swapping and testing different models or using local models for testing. It also offers additional features such as cost tracking, error handling, and logging.
LiteLLM’s support for Databricks models enables developers to:
The LiteLLM Python Client makes it easy to invoke models from different providers via a consistent interface in Python.
First, install LiteLLM into your Python environment with
pip install ‘litellm[proxy]’.
Next, set up your Databricks model serving credentials:
export DATABRICKS_API_KEY=<your_databricks_PAT>
export DATABRICKS_API_BASE=https://<your_databricks_workspace>/serving-endpoints
With these setup steps completed, we can start invoking Databricks models using LiteLLM. LiteLLM enables us to use models from any supported provider, including Databricks Model Serving, via the same interface using the LiteLLM Python client.
from litellm import completion
response = completion(
model="databricks/databricks-dbrx-instruct",
messages=[
{"content": "You are a helpful assistant.", "role": "system"},
{
"content": "Is this sentence correct? 'Their are many countries in Europe'",
"role": "user",
},
],
)
print(response)
Which returns:
ModelResponse(
id='chatcmpl_fee3bc28-562e-4c67-bf14-5628d6cd348c',
choices=[
Choices(
finish_reason='stop',
index=0,
message=Message(
content='No, the sentence is not correct. The correct form should be "There are many countries in
Europe." The word "their" is a possessive pronoun, while "there" is used to indicate a place or to introduce a
sentence.',
role='assistant'
)
)
],
created=1720191284,
model='dbrx-instruct-032724',
object='chat.completion',
system_fingerprint=None,
usage=Usage(prompt_tokens=32, completion_tokens=49, total_tokens=81)
)
You can use the same approach to call self-hosted models and models from other providers, simplifying the process of using multiple models in a project without needing to learn each provider’s specific APIs and clients.
Now that you have a basic understanding of why it might be useful to use Databricks model serving with LiteLLM and how to get started with the LiteLLM Python client, let’s look into some of the powerful features enabled via the LiteLLM OpenAI Proxy.
Suppose you are working with a team of developers and want to enable them to access models from multiple providers and keep track of usage. The LiteLLM OpenAI Proxy Server allows us to set up an OpenAI-compatible proxy that lets developers call any supported provider using curl requests or the OpenAI Python SDK. The proxy server includes features such as authentication/key management, spend tracking, load balancing, and fallbacks.
In this example, we will use the proxy to give developers access to the Databricks DBRX model and the Anthropic Claude 3.5 Sonnet model, and then log their respective token usages.
We’ll configure the proxy server with the following config.yaml file. This configuration will expose the Databricks DBRX model from Databricks Model Serving and Claude 3.5 Sonnet via the Anthropic API.
model_list:
- model_name: dbrx
litellm_params:
model: databricks/databricks-dbrx-instruct
api_key: os.environ/DATABRICKS_API_KEY
api_base: os.environ/DATABRICKS_API_BASE
- model_name: claude
litellm_params:
model: databricks/claude-3-5-sonnet
api_key: os.environ/ANTHROPIC_API_KEY
We can then start the server with:
litellm --config config.yaml
And call either of these models with OpenAI-compatible methods. For example, we can now use the OpenAI Python client to call DBRX via the LiteLLM proxy as follows:
import openai
client = openai.OpenAI(
api_key="anything",
base_url="http://0.0.0.0:4000"
)
response = client.chat.completions.create(model="dbrx", messages = [
{
"role": "user",
"content": "Is this sentence correct? 'Their are many countries in Europe'"
}
],
max_tokens=25)
To use Claude instead, all we need to do is change the model name:
response = client.chat.completions.create(model="claude", messages = ...
Everything else stays the same, making it very easy to switch between models. You can also use OpenAI-compatible REST API calls via, for example, curl or the Python requests library.
The LiteLLM OpenAI Proxy Server lets us log token usage. This lets us collect cost and usage details in one place, saving us from needing to check the usage dashboards of multiple different LLM providers. We can do this by implementing a callback and adding it to our config. To log usage, create a new file called custom_callbacks.py and subclass the litellm.integrations.custom_logger.CustomLogger class:
from litellm.integrations.custom_logger import CustomLogger
import litellm
import logging
class MyCustomHandler(CustomLogger):
async def async_log_success_event(self, kwargs, response_obj, start_time, end_time):
try:
# init logging config
logging.basicConfig(
filename='cost.log',
level=logging.INFO,
format='%(asctime)s - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S'
)
response_cost = kwargs.get("response_cost")
input_tokens = response_obj.usage.prompt_tokens
output_tokens = response_obj.usage.completion_tokens
print("input_tokens", input_tokens, "output_tokens", output_tokens)
logging.info(f"Model: {response_obj.model} Input Tokens: {input_tokens} Output Tokens: {output_tokens} Response Cost: {response_cost}")
except Exception as e:
print(f"Failed to log usage: {e}")
proxy_handler_instance = MyCustomHandler()
We also need to update the config.yaml file to add the cost per input/output token for DBRX. You can use this calculator to determine costs depending on your cloud provider and region. We update the DBRX config entry as follows:
- model_name: dbrx
litellm_params:
model: databricks/databricks-dbrx-instruct
api_key: os.environ/DATABRICKS_API_KEY
api_base: os.environ/DATABRICKS_API_BASE
input_cost_per_token: 0.00000075
output_cost_per_token: 0.00000225
When we call either model via the OpenAI proxy, the token usage and cost information will be recorded in the usage.log file. If needed, we can aggregate and analyze the log data to get a unified view of usage and costs across providers.
2024-07-16 12:17:55 - Model: dbrx-instruct-032724 Input Tokens: 237 Output Tokens: 47 Response Cost: 0.0002835
2024-07-16 12:18:06 - Model: claude-3-5-sonnet-20240620 Input Tokens: 21 Output Tokens: 145 Response Cost: 0.002238
2024-07-16 12:18:28 - Model: claude-3-5-sonnet-20240620 Input Tokens: 23 Output Tokens: 284 Response Cost: 0.0043289
2024-07-16 12:18:32 - Model: dbrx-instruct-032724 Input Tokens: 238 Output Tokens: 81 Response Cost: 0.00036075
2024-07-16 12:18:55 - Model: dbrx-instruct-032724 Input Tokens: 230 Output Tokens: 37 Response Cost: 0.00025575
2024-07-16 12:22:39 - Model: dbrx-instruct-032724 Input Tokens: 230 Output Tokens: 45 Response Cost: 0.00027375
2024-07-16 12:22:54 - Model: claude-3-5-sonnet-20240620 Input Tokens: 15 Output Tokens: 123 Response Cost: 0.00189
2024-07-16 12:23:28 - Model: claude-3-5-sonnet-20240620 Input Tokens: 18 Output Tokens: 195 Response Cost: 0.002979
2024-07-16 12:23:40 - Model: dbrx-instruct-032724 Input Tokens: 234 Output Tokens: 48 Response Cost: 0.0002835
For more advanced user- and team-level monitoring, access management, and spend tracking, you can set up a Postgres database and create API keys. There is also a UI for adding users, creating keys, monitoring usage, and more.
The examples above focused on completions, but it’s worth noting that LiteLLM supports other types of models as well. For example, we can call the gte-large-en embedding model available via the Databricks foundation models API via litellm.embeddings:
from litellm import embedding
response = embedding(
model="databricks/databricks-gte-large-en",
input=["General text embeddings (GTE) can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. And it also can be used in vector databases for LLMs."],
instruction="Represent this sentence for searching relevant passages:",
)
print(response)
Which returns:
EmbeddingResponse(
model='gte-large-en-v1.5',
data=[
{
'index': 0,
'object': 'embedding',
'embedding': [
1.0078125,
-0.25537109375,
-0.755859375,
-0.0692138671875,
[...]
1.36328125,
-0.2440185546875,
-0.2159423828125
]
}
],
object='list',
usage=Usage(prompt_tokens=62, total_tokens=62)
)
This was a quick introduction to using Databricks model serving with LiteLLM. After reading this guide, you should be able to:
There is much more you can do with the combination of Databricks model serving and LiteLLM. Here are some ideas:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.