Great question -- this is a nuanced topic because there are two layers involved: Databricks' proxy layer and OpenAI's caching mechanism.
Short answer: No, ai_query does not currently support OpenAI's prompt caching.
1. ai_query doesn't expose token usage metadata
aiquery is a SQL function that returns only the model's text response -- it does **not** return the full response object including usage.prompttokensdetails.cachedtokens. So even if caching were happening behind the scenes, you'd have no way to verify it from the ai_query output.
2. Databricks Foundation Model APIs act as a proxy
When you call an OpenAI model through Databricks (whether via ai_query, the REST API, or the OpenAI SDK pointed at a Databricks serving endpoint), your request goes through Databricks' infrastructure, not directly to OpenAI.
OpenAI's automatic prompt caching works by:
- Routing requests to a specific machine based on a hash of the prompt prefix
- Caching prompts with 1024+ tokens
- Caching is scoped to the organization making the API call
Since Databricks is the one making the call to OpenAI (not you directly), the caching behavior is governed by how Databricks routes and batches these requests on their infrastructure. The cached_tokens = 0 result confirms that caching is not occurring through this path.
3. What about the OpenAI SDK test?
When you use the OpenAI SDK with identical model and settings but pointed at a Databricks serving endpoint (e.g., baseurl = "https://workspace.databricks.com/serving-endpoints"), you're still going through Databricks' proxy -- not hitting OpenAI directly. That's why cachedtokens = 0.
If you point the OpenAI SDK directly at https://api.openai.com with your own OpenAI API key and repeat the test, you will see caching kick in (assuming 1024+ tokens and the same prompt prefix).
Alternatives
Option A: Call OpenAI directly
If prompt caching savings are significant for your workload, bypass Databricks' Foundation Model APIs and call OpenAI's API directly using a Python UDF or notebook:
import openai
client = openai.OpenAI(api_key="<your-openai-key>") # Direct to OpenAI
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "<your 1024+ token prompt>"}]
)
print(response.usage.prompt_tokens_details.cached_tokens) # Should show cache hits
Option B: Use Databricks-hosted Claude with explicit caching
Databricks does support prompt caching for Claude models via the cache_control parameter in the Foundation Model API:
import requests
response = requests.post(
f"{db_host}/serving-endpoints/databricks-claude-sonnet-4/invocations",
headers={"Authorization": f"Bearer {token}"},
json={
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "<long context>", "cache_control": {"type": "ephemeral"}},
{"type": "text", "text": "Your question"}
]
}]
}
)
Option C: Use an external model endpoint with AI Gateway
Register your own OpenAI API key as an external model endpoint, which routes calls through Databricks' AI Gateway but directly to OpenAI. This may preserve caching behavior (though it's not guaranteed depending on routing).
Summary
|
Path
|
Caching Works?
|
Why
|
|
ai_query via Databricks FMAPI
|
No
|
Proxied through Databricks; no usage metadata returned
|
|
OpenAI SDK via Databricks endpoint
|
No
|
Still proxied through Databricks
|
|
OpenAI SDK via api.openai.com directly
|
Yes
|
Direct connection, OpenAI handles routing + caching
|
|
Databricks FMAPI with Claude models
|
Yes
|
Explicit cache_control parameter supported
|
References
Anuj Lathi
Solutions Engineer @ Databricks