Databricks Community

Daniel-Liden · ‎08-05-2024

LLMs on Databricks are now available to call via LiteLLM. LiteLLM is a library that provides a python client and an OpenAI-compatible proxy for accessing 100+ LLMs with the same input/output formats, making it easy to manage and switch between models from different providers. This includes both hosted models (OpenAI/Azure/Bedrock, etc.) and self-hosted models (Ollama/VLLM/TGI, etc.). LiteLLM also works across different endpoints—chat, completion, embeddings, image generation, etc.

What Databricks models are available?

LiteLLM supports models available via the foundation models API, external models, and other chat and embedding models hosted with model serving. This includes chat, completion, and embedding models.

This post will show how to start using LiteLLM with Databricks. We’ll start with a quick discussion of how LiteLLM and Databricks complement each other. Next, we’ll work through a quickstart example of using litellm.completion to call models from the Databricks Foundation Models API. Next, the post gives a demo of the LiteLLM OpenAI Proxy. We’ll use the proxy to call models from different providers and to log usage. The post concludes with some examples and links to other ways of using LiteLLM with Databricks models.

Why use LiteLLM with Databricks?

Using LiteLLM with Databricks model serving builds on the flexibility offered by both for managing and deploying LLMs. Databricks provides robust MLOps capabilities, scalable inference, production-ready observability features, and support for various open-weights models and proprietary models from providers like Anthropic and OpenAI.

LiteLLM complements these capabilities with a unified API for numerous LLM providers and local or self-hosted LLM platforms, simplifying the process of swapping and testing different models or using local models for testing. It also offers additional features such as cost tracking, error handling, and logging.

LiteLLM’s support for Databricks models enables developers to:

Quickly prototype apps across different LLM providers (Databricks/Anthropic/OpenAI/etc.) with a fixed interface
Maximize TPM/RPM limits across multiple deployments for the same LLM
Implement fallback mechanisms between Databricks models and other model providers for enhanced reliability
Transition between self-hosted models and models hosted on Databricks Model Serving for simplified local development and testing.

Quickstart: Using the Python Client

The LiteLLM Python Client makes it easy to invoke models from different providers via a consistent interface in Python.

Installation and Setup

First, install LiteLLM into your Python environment with

pip install ‘litellm[proxy]’.

Next, set up your Databricks model serving credentials:

export DATABRICKS_API_KEY=<your_databricks_PAT>

export DATABRICKS_API_BASE=https://<your_databricks_workspace>/serving-endpoints

With these setup steps completed, we can start invoking Databricks models using LiteLLM. LiteLLM enables us to use models from any supported provider, including Databricks Model Serving, via the same interface using the LiteLLM Python client.

Calling DBRX via Databricks Model Serving

from litellm import completion

response = completion(
    model="databricks/databricks-dbrx-instruct",
    messages=[
        {"content": "You are a helpful assistant.", "role": "system"},
        {
            "content": "Is this sentence correct? 'Their are many countries in Europe'",
            "role": "user",
        },
    ],
)

print(response)

Which returns:

ModelResponse(
    id='chatcmpl_fee3bc28-562e-4c67-bf14-5628d6cd348c',
    choices=[
        Choices(
            finish_reason='stop',
            index=0,
            message=Message(
                content='No, the sentence is not correct. The correct form should be "There are many countries in 
Europe." The word "their" is a possessive pronoun, while "there" is used to indicate a place or to introduce a 
sentence.',
                role='assistant'
            )
        )
    ],
    created=1720191284,
    model='dbrx-instruct-032724',
    object='chat.completion',
    system_fingerprint=None,
    usage=Usage(prompt_tokens=32, completion_tokens=49, total_tokens=81)
)

You can use the same approach to call self-hosted models and models from other providers, simplifying the process of using multiple models in a project without needing to learn each provider’s specific APIs and clients.

Now that you have a basic understanding of why it might be useful to use Databricks model serving with LiteLLM and how to get started with the LiteLLM Python client, let’s look into some of the powerful features enabled via the LiteLLM OpenAI Proxy.

Demo: Monitoring Usage with the LiteLLM Proxy

Suppose you are working with a team of developers and want to enable them to access models from multiple providers and keep track of usage. The LiteLLM OpenAI Proxy Server allows us to set up an OpenAI-compatible proxy that lets developers call any supported provider using curl requests or the OpenAI Python SDK. The proxy server includes features such as authentication/key management, spend tracking, load balancing, and fallbacks.

In this example, we will use the proxy to give developers access to the Databricks DBRX model and the Anthropic Claude 3.5 Sonnet model, and then log their respective token usages.

Configure and Start the Proxy Server

We’ll configure the proxy server with the following config.yaml file. This configuration will expose the Databricks DBRX model from Databricks Model Serving and Claude 3.5 Sonnet via the Anthropic API.

model_list:
  - model_name: dbrx
    litellm_params:
      model: databricks/databricks-dbrx-instruct
      api_key: os.environ/DATABRICKS_API_KEY
      api_base: os.environ/DATABRICKS_API_BASE
  - model_name: claude
    litellm_params:
      model: databricks/claude-3-5-sonnet
      api_key: os.environ/ANTHROPIC_API_KEY

We can then start the server with:

litellm --config config.yaml

Call Models with the OpenAI Client

And call either of these models with OpenAI-compatible methods. For example, we can now use the OpenAI Python client to call DBRX via the LiteLLM proxy as follows:

import openai
client = openai.OpenAI(
    api_key="anything",
    base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(model="dbrx", messages = [
    {
        "role": "user",
        "content": "Is this sentence correct? 'Their are many countries in Europe'"
    }
],
max_tokens=25)

To use Claude instead, all we need to do is change the model name:

response = client.chat.completions.create(model="claude", messages = ...

Everything else stays the same, making it very easy to switch between models. You can also use OpenAI-compatible REST API calls via, for example, curl or the Python requests library.

Monitoring Costs and Usage

The LiteLLM OpenAI Proxy Server lets us log token usage. This lets us collect cost and usage details in one place, saving us from needing to check the usage dashboards of multiple different LLM providers. We can do this by implementing a callback and adding it to our config. To log usage, create a new file called custom_callbacks.py and subclass the litellm.integrations.custom_logger.CustomLogger class:

from litellm.integrations.custom_logger import CustomLogger
import litellm
import logging

class MyCustomHandler(CustomLogger):
    async def async_log_success_event(self, kwargs, response_obj, start_time, end_time):
        try:
            # init logging config
            logging.basicConfig(
                    filename='cost.log',
                    level=logging.INFO,
                    format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S'
            )
            response_cost = kwargs.get("response_cost")
            input_tokens = response_obj.usage.prompt_tokens
            output_tokens = response_obj.usage.completion_tokens
            print("input_tokens", input_tokens, "output_tokens", output_tokens)
            logging.info(f"Model: {response_obj.model} Input Tokens: {input_tokens} Output Tokens: {output_tokens} Response Cost: {response_cost}")
        except Exception as e:
            print(f"Failed to log usage: {e}")

proxy_handler_instance = MyCustomHandler()

We also need to update the config.yaml file to add the cost per input/output token for DBRX. You can use this calculator to determine costs depending on your cloud provider and region. We update the DBRX config entry as follows:

  - model_name: dbrx
    litellm_params:
      model: databricks/databricks-dbrx-instruct
      api_key: os.environ/DATABRICKS_API_KEY
      api_base: os.environ/DATABRICKS_API_BASE
      input_cost_per_token: 0.00000075
      output_cost_per_token: 0.00000225

When we call either model via the OpenAI proxy, the token usage and cost information will be recorded in the usage.log file. If needed, we can aggregate and analyze the log data to get a unified view of usage and costs across providers.

2024-07-16 12:17:55 - Model: dbrx-instruct-032724 Input Tokens: 237 Output Tokens: 47 Response Cost: 0.0002835
2024-07-16 12:18:06 - Model: claude-3-5-sonnet-20240620 Input Tokens: 21 Output Tokens: 145 Response Cost: 0.002238
2024-07-16 12:18:28 - Model: claude-3-5-sonnet-20240620 Input Tokens: 23 Output Tokens: 284 Response Cost: 0.0043289
2024-07-16 12:18:32 - Model: dbrx-instruct-032724 Input Tokens: 238 Output Tokens: 81 Response Cost: 0.00036075
2024-07-16 12:18:55 - Model: dbrx-instruct-032724 Input Tokens: 230 Output Tokens: 37 Response Cost: 0.00025575
2024-07-16 12:22:39 - Model: dbrx-instruct-032724 Input Tokens: 230 Output Tokens: 45 Response Cost: 0.00027375
2024-07-16 12:22:54 - Model: claude-3-5-sonnet-20240620 Input Tokens: 15 Output Tokens: 123 Response Cost: 0.00189
2024-07-16 12:23:28 - Model: claude-3-5-sonnet-20240620 Input Tokens: 18 Output Tokens: 195 Response Cost: 0.002979
2024-07-16 12:23:40 - Model: dbrx-instruct-032724 Input Tokens: 234 Output Tokens: 48 Response Cost: 0.0002835

For more advanced user- and team-level monitoring, access management, and spend tracking, you can set up a Postgres database and create API keys. There is also a UI for adding users, creating keys, monitoring usage, and more.

Other model types

The examples above focused on completions, but it’s worth noting that LiteLLM supports other types of models as well. For example, we can call the gte-large-en embedding model available via the Databricks foundation models API via litellm.embeddings:

from litellm import embedding

response = embedding(
      model="databricks/databricks-gte-large-en",
      input=["General text embeddings (GTE) can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. And it also can be used in vector databases for LLMs."],
      instruction="Represent this sentence for searching relevant passages:",
  )

print(response)

Which returns:

EmbeddingResponse(
    model='gte-large-en-v1.5',
    data=[
        {
            'index': 0,
            'object': 'embedding',
            'embedding': [
                1.0078125,
                -0.25537109375,
                -0.755859375,
                -0.0692138671875,
                [...]
                1.36328125,
                -0.2440185546875,
                -0.2159423828125
            ]
        }
    ],
    object='list',
    usage=Usage(prompt_tokens=62, total_tokens=62)
)

Next Steps

This was a quick introduction to using Databricks model serving with LiteLLM. After reading this guide, you should be able to:

Call Databricks Model Serving models, and models from other providers, with LiteLLM
Set up the LiteLLM OpenAI Proxy Server and call models from different sources with the OpenAI Python Client
Use Completions, Chat, and Embedding models with LiteLLM

There is much more you can do with the combination of Databricks model serving and LiteLLM. Here are some ideas:

Serve custom chat models with Databricks model serving and add them to your LiteLLM OpenAI Proxy Server
Add models from other providers: maybe in addition to Databricks and Anthropic models you also want to try out models from some other providers or even locally-hosted models via Ollama. You can do so by adding them to your proxy config file or calling them directly from the LiteLLM Python client.
Configure load balancing, fallbacks, retries, and timeouts with the LiteLLM OpenAI proxy server. This is particularly useful if you expect a high volume of requests and want to ensure reliable performance.
Write a custom callback to log model calls/responses to MLflow.

Databricks Community

Simplifying Multi-Model LLM Development: A Developer’s Guide to LiteLLM and Databricks

What Databricks models are available?

Why use LiteLLM with Databricks?

Quickstart: Using the Python Client

Installation and Setup

Calling DBRX via Databricks Model Serving

Demo: Monitoring Usage with the LiteLLM Proxy

Configure and Start the Proxy Server

Call Models with the OpenAI Client

Monitoring Costs and Usage

Other model types

Next Steps

Metadata-Driven ETL Framework in Databricks (Part-1)

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL