cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

ai_query and cached tokens

samuel86
New Contributor III

Is ai_query actually able to use OpenAI's cached tokens? I was not unable to prove it. The response object from ai_query does not contain the raw response, and when I re-run an identical request via OpenAI SDK (identical model, settings etc.) and examine the response, cached_tokens = 0, which indicates that caching doe snot work in this setup, for whatever reason.

1 ACCEPTED SOLUTION

Accepted Solutions

anuj_lathi
Databricks Employee
Databricks Employee

Great question -- this is a nuanced topic because there are two layers involved: Databricks' proxy layer and OpenAI's caching mechanism.

Short answer: No, ai_query does not currently support OpenAI's prompt caching.

1. ai_query doesn't expose token usage metadata

aiquery is a SQL function that returns only the model's text response -- it does **not** return the full response object including usage.prompttokensdetails.cachedtokens. So even if caching were happening behind the scenes, you'd have no way to verify it from the ai_query output.

2. Databricks Foundation Model APIs act as a proxy

When you call an OpenAI model through Databricks (whether via ai_query, the REST API, or the OpenAI SDK pointed at a Databricks serving endpoint), your request goes through Databricks' infrastructure, not directly to OpenAI.

OpenAI's automatic prompt caching works by:

  • Routing requests to a specific machine based on a hash of the prompt prefix
  • Caching prompts with 1024+ tokens
  • Caching is scoped to the organization making the API call

Since Databricks is the one making the call to OpenAI (not you directly), the caching behavior is governed by how Databricks routes and batches these requests on their infrastructure. The cached_tokens = 0 result confirms that caching is not occurring through this path.

3. What about the OpenAI SDK test?

When you use the OpenAI SDK with identical model and settings but pointed at a Databricks serving endpoint (e.g., baseurl = "https://workspace.databricks.com/serving-endpoints"), you're still going through Databricks' proxy -- not hitting OpenAI directly. That's why cachedtokens = 0.

If you point the OpenAI SDK directly at https://api.openai.com with your own OpenAI API key and repeat the test, you will see caching kick in (assuming 1024+ tokens and the same prompt prefix).

Alternatives

Option A: Call OpenAI directly

If prompt caching savings are significant for your workload, bypass Databricks' Foundation Model APIs and call OpenAI's API directly using a Python UDF or notebook:

import openai

 

client = openai.OpenAI(api_key="<your-openai-key>")  # Direct to OpenAI

 

response = client.chat.completions.create(

    model="gpt-4o",

    messages=[{"role": "user", "content": "<your 1024+ token prompt>"}]

)

 

print(response.usage.prompt_tokens_details.cached_tokens)  # Should show cache hits

 

Option B: Use Databricks-hosted Claude with explicit caching

Databricks does support prompt caching for Claude models via the cache_control parameter in the Foundation Model API:

import requests

 

response = requests.post(

    f"{db_host}/serving-endpoints/databricks-claude-sonnet-4/invocations",

    headers={"Authorization": f"Bearer {token}"},

    json={

        "messages": [{

            "role": "user",

            "content": [

                {"type": "text", "text": "<long context>", "cache_control": {"type": "ephemeral"}},

                {"type": "text", "text": "Your question"}

            ]

        }]

    }

)

 

Option C: Use an external model endpoint with AI Gateway

Register your own OpenAI API key as an external model endpoint, which routes calls through Databricks' AI Gateway but directly to OpenAI. This may preserve caching behavior (though it's not guaranteed depending on routing).

Summary

 

Path

Caching Works?

Why

ai_query via Databricks FMAPI

No

Proxied through Databricks; no usage metadata returned

OpenAI SDK via Databricks endpoint

No

Still proxied through Databricks

OpenAI SDK via api.openai.com directly

Yes

Direct connection, OpenAI handles routing + caching

Databricks FMAPI with Claude models

Yes

Explicit cache_control parameter supported

References

Anuj Lathi
Solutions Engineer @ Databricks

View solution in original post

1 REPLY 1

anuj_lathi
Databricks Employee
Databricks Employee

Great question -- this is a nuanced topic because there are two layers involved: Databricks' proxy layer and OpenAI's caching mechanism.

Short answer: No, ai_query does not currently support OpenAI's prompt caching.

1. ai_query doesn't expose token usage metadata

aiquery is a SQL function that returns only the model's text response -- it does **not** return the full response object including usage.prompttokensdetails.cachedtokens. So even if caching were happening behind the scenes, you'd have no way to verify it from the ai_query output.

2. Databricks Foundation Model APIs act as a proxy

When you call an OpenAI model through Databricks (whether via ai_query, the REST API, or the OpenAI SDK pointed at a Databricks serving endpoint), your request goes through Databricks' infrastructure, not directly to OpenAI.

OpenAI's automatic prompt caching works by:

  • Routing requests to a specific machine based on a hash of the prompt prefix
  • Caching prompts with 1024+ tokens
  • Caching is scoped to the organization making the API call

Since Databricks is the one making the call to OpenAI (not you directly), the caching behavior is governed by how Databricks routes and batches these requests on their infrastructure. The cached_tokens = 0 result confirms that caching is not occurring through this path.

3. What about the OpenAI SDK test?

When you use the OpenAI SDK with identical model and settings but pointed at a Databricks serving endpoint (e.g., baseurl = "https://workspace.databricks.com/serving-endpoints"), you're still going through Databricks' proxy -- not hitting OpenAI directly. That's why cachedtokens = 0.

If you point the OpenAI SDK directly at https://api.openai.com with your own OpenAI API key and repeat the test, you will see caching kick in (assuming 1024+ tokens and the same prompt prefix).

Alternatives

Option A: Call OpenAI directly

If prompt caching savings are significant for your workload, bypass Databricks' Foundation Model APIs and call OpenAI's API directly using a Python UDF or notebook:

import openai

 

client = openai.OpenAI(api_key="<your-openai-key>")  # Direct to OpenAI

 

response = client.chat.completions.create(

    model="gpt-4o",

    messages=[{"role": "user", "content": "<your 1024+ token prompt>"}]

)

 

print(response.usage.prompt_tokens_details.cached_tokens)  # Should show cache hits

 

Option B: Use Databricks-hosted Claude with explicit caching

Databricks does support prompt caching for Claude models via the cache_control parameter in the Foundation Model API:

import requests

 

response = requests.post(

    f"{db_host}/serving-endpoints/databricks-claude-sonnet-4/invocations",

    headers={"Authorization": f"Bearer {token}"},

    json={

        "messages": [{

            "role": "user",

            "content": [

                {"type": "text", "text": "<long context>", "cache_control": {"type": "ephemeral"}},

                {"type": "text", "text": "Your question"}

            ]

        }]

    }

)

 

Option C: Use an external model endpoint with AI Gateway

Register your own OpenAI API key as an external model endpoint, which routes calls through Databricks' AI Gateway but directly to OpenAI. This may preserve caching behavior (though it's not guaranteed depending on routing).

Summary

 

Path

Caching Works?

Why

ai_query via Databricks FMAPI

No

Proxied through Databricks; no usage metadata returned

OpenAI SDK via Databricks endpoint

No

Still proxied through Databricks

OpenAI SDK via api.openai.com directly

Yes

Direct connection, OpenAI handles routing + caching

Databricks FMAPI with Claude models

Yes

Explicit cache_control parameter supported

References

Anuj Lathi
Solutions Engineer @ Databricks