<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: ai_query and cached tokens in Generative AI</title>
    <link>https://community.databricks.com/t5/generative-ai/ai-query-and-cached-tokens/m-p/154020#M1744</link>
    <description>&lt;P&gt;&lt;SPAN&gt;Great question -- this is a nuanced topic because there are two layers involved: &lt;/SPAN&gt;&lt;STRONG&gt;Databricks' proxy layer&lt;/STRONG&gt;&lt;SPAN&gt; and &lt;/SPAN&gt;&lt;STRONG&gt;OpenAI's caching mechanism&lt;/STRONG&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Short answer: No, ai_query does not currently support OpenAI's prompt caching.&lt;/STRONG&gt;&lt;/H3&gt;
&lt;H3&gt;&lt;STRONG&gt;1. ai_query doesn't expose token usage metadata&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;ai&lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;query is a SQL function that returns only the model's text response -- it does **not** return the full response object including usage.prompt&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt;tokens&lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;details.cached&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt;tokens. So even if caching were happening behind the scenes, you'd have no way to verify it from the ai_query output.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;2. Databricks Foundation Model APIs act as a proxy&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;When you call an OpenAI model through Databricks (whether via ai_query, the REST API, or the OpenAI SDK pointed at a Databricks serving endpoint), your request goes through &lt;/SPAN&gt;&lt;STRONG&gt;Databricks' infrastructure&lt;/STRONG&gt;&lt;SPAN&gt;, not directly to OpenAI.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;OpenAI's automatic prompt caching works by:&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;SPAN&gt;Routing requests to a specific machine based on a &lt;/SPAN&gt;&lt;STRONG&gt;hash of the prompt prefix&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;SPAN&gt;Caching prompts with &lt;/SPAN&gt;&lt;STRONG&gt;1024+ tokens&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;SPAN&gt;Caching is &lt;/SPAN&gt;&lt;STRONG&gt;scoped to the organization&lt;/STRONG&gt;&lt;SPAN&gt; making the API call&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN&gt;Since Databricks is the one making the call to OpenAI (not you directly), the caching behavior is governed by how Databricks routes and batches these requests on their infrastructure. The cached_tokens = 0 result confirms that caching is &lt;/SPAN&gt;&lt;STRONG&gt;not&lt;/STRONG&gt;&lt;SPAN&gt; occurring through this path.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;3. What about the OpenAI SDK test?&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;When you use the OpenAI SDK with identical model and settings but pointed at a &lt;/SPAN&gt;&lt;STRONG&gt;Databricks serving endpoint&lt;/STRONG&gt;&lt;SPAN&gt; (e.g., base&lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;url = "&lt;A href="https://workspace.databricks.com/serving-endpoints" target="_blank"&gt;https://workspace.databricks.com/serving-endpoints&lt;/A&gt;"), you're still going through Databricks' proxy -- not hitting OpenAI directly. That's why cached&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt;tokens = 0.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;If you point the OpenAI SDK directly at &lt;A href="https://api.openai.com" target="_blank"&gt;https://api.openai.com&lt;/A&gt; with your own OpenAI API key and repeat the test, you &lt;/SPAN&gt;&lt;STRONG&gt;will&lt;/STRONG&gt;&lt;SPAN&gt; see caching kick in (assuming 1024+ tokens and the same prompt prefix).&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Alternatives&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;Option A: Call OpenAI directly&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;If prompt caching savings are significant for your workload, bypass Databricks' Foundation Model APIs and call OpenAI's API directly using a Python UDF or notebook:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;import openai&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;client = openai.OpenAI(api_key="&amp;lt;your-openai-key&amp;gt;")&amp;nbsp; # Direct to OpenAI&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;response = client.chat.completions.create(&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;model="gpt-4o",&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;messages=[{"role": "user", "content": "&amp;lt;your 1024+ token prompt&amp;gt;"}]&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;print(response.usage.prompt_tokens_details.cached_tokens)&amp;nbsp; # Should show cache hits&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Option B: Use Databricks-hosted Claude with explicit caching&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Databricks &lt;/SPAN&gt;&lt;STRONG&gt;does&lt;/STRONG&gt;&lt;SPAN&gt; support prompt caching for Claude models via the cache_control parameter in the Foundation Model API:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;import requests&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;response = requests.post(&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;f"{db_host}/serving-endpoints/databricks-claude-sonnet-4/invocations",&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;headers={"Authorization": f"Bearer {token}"},&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;json={&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;"messages": [{&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;"role": "user",&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;"content": [&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;{"type": "text", "text": "&amp;lt;long context&amp;gt;", "cache_control": {"type": "ephemeral"}},&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;{"type": "text", "text": "Your question"}&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;]&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;}]&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;}&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Option C: Use an external model endpoint with AI Gateway&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Register your own OpenAI API key as an external model endpoint, which routes calls through Databricks' AI Gateway but directly to OpenAI. This may preserve caching behavior (though it's not guaranteed depending on routing).&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Summary&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;TABLE&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Path&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Caching Works?&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Why&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;ai_query via Databricks FMAPI&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;No&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Proxied through Databricks; no usage metadata returned&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;OpenAI SDK via Databricks endpoint&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;No&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Still proxied through Databricks&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;OpenAI SDK via api.openai.com directly&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Yes&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Direct connection, OpenAI handles routing + caching&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Databricks FMAPI with Claude models&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Yes&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Explicit cache_control parameter supported&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;H3&gt;&lt;STRONG&gt;References&lt;/STRONG&gt;&lt;/H3&gt;
&lt;UL&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://developers.openai.com/api/docs/guides/prompt-caching" target="_blank"&gt;&lt;SPAN&gt;OpenAI Prompt Caching Guide&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://docs.databricks.com/aws/en/machine-learning/foundation-model-apis/" target="_blank"&gt;&lt;SPAN&gt;Databricks Foundation Model APIs&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://docs.databricks.com/aws/en/machine-learning/model-serving/score-foundation-models" target="_blank"&gt;&lt;SPAN&gt;Databricks -- Use Foundation Models&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://www.ddhigh.com/en/2026/03/26/fix-opencode-prompt-caching-with-third-party-proxy/" target="_blank"&gt;&lt;SPAN&gt;Fix Prompt Cache Misses with Third-Party Proxy&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;</description>
    <pubDate>Fri, 10 Apr 2026 04:31:17 GMT</pubDate>
    <dc:creator>anuj_lathi</dc:creator>
    <dc:date>2026-04-10T04:31:17Z</dc:date>
    <item>
      <title>ai_query and cached tokens</title>
      <link>https://community.databricks.com/t5/generative-ai/ai-query-and-cached-tokens/m-p/153984#M1743</link>
      <description>&lt;P&gt;Is ai_query actually able to use OpenAI's cached tokens? I was not unable to prove it. The response object from ai_query does not contain the raw response, and when I re-run an identical request via OpenAI SDK (identical model, settings etc.) and examine the response, cached_tokens = 0, which indicates that caching doe snot work in this setup, for whatever reason.&lt;/P&gt;</description>
      <pubDate>Thu, 09 Apr 2026 19:30:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/ai-query-and-cached-tokens/m-p/153984#M1743</guid>
      <dc:creator>samuel86</dc:creator>
      <dc:date>2026-04-09T19:30:11Z</dc:date>
    </item>
    <item>
      <title>Re: ai_query and cached tokens</title>
      <link>https://community.databricks.com/t5/generative-ai/ai-query-and-cached-tokens/m-p/154020#M1744</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Great question -- this is a nuanced topic because there are two layers involved: &lt;/SPAN&gt;&lt;STRONG&gt;Databricks' proxy layer&lt;/STRONG&gt;&lt;SPAN&gt; and &lt;/SPAN&gt;&lt;STRONG&gt;OpenAI's caching mechanism&lt;/STRONG&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Short answer: No, ai_query does not currently support OpenAI's prompt caching.&lt;/STRONG&gt;&lt;/H3&gt;
&lt;H3&gt;&lt;STRONG&gt;1. ai_query doesn't expose token usage metadata&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;ai&lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;query is a SQL function that returns only the model's text response -- it does **not** return the full response object including usage.prompt&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt;tokens&lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;details.cached&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt;tokens. So even if caching were happening behind the scenes, you'd have no way to verify it from the ai_query output.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;2. Databricks Foundation Model APIs act as a proxy&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;When you call an OpenAI model through Databricks (whether via ai_query, the REST API, or the OpenAI SDK pointed at a Databricks serving endpoint), your request goes through &lt;/SPAN&gt;&lt;STRONG&gt;Databricks' infrastructure&lt;/STRONG&gt;&lt;SPAN&gt;, not directly to OpenAI.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;OpenAI's automatic prompt caching works by:&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;SPAN&gt;Routing requests to a specific machine based on a &lt;/SPAN&gt;&lt;STRONG&gt;hash of the prompt prefix&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;SPAN&gt;Caching prompts with &lt;/SPAN&gt;&lt;STRONG&gt;1024+ tokens&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;SPAN&gt;Caching is &lt;/SPAN&gt;&lt;STRONG&gt;scoped to the organization&lt;/STRONG&gt;&lt;SPAN&gt; making the API call&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN&gt;Since Databricks is the one making the call to OpenAI (not you directly), the caching behavior is governed by how Databricks routes and batches these requests on their infrastructure. The cached_tokens = 0 result confirms that caching is &lt;/SPAN&gt;&lt;STRONG&gt;not&lt;/STRONG&gt;&lt;SPAN&gt; occurring through this path.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;3. What about the OpenAI SDK test?&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;When you use the OpenAI SDK with identical model and settings but pointed at a &lt;/SPAN&gt;&lt;STRONG&gt;Databricks serving endpoint&lt;/STRONG&gt;&lt;SPAN&gt; (e.g., base&lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;url = "&lt;A href="https://workspace.databricks.com/serving-endpoints" target="_blank"&gt;https://workspace.databricks.com/serving-endpoints&lt;/A&gt;"), you're still going through Databricks' proxy -- not hitting OpenAI directly. That's why cached&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt;tokens = 0.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;If you point the OpenAI SDK directly at &lt;A href="https://api.openai.com" target="_blank"&gt;https://api.openai.com&lt;/A&gt; with your own OpenAI API key and repeat the test, you &lt;/SPAN&gt;&lt;STRONG&gt;will&lt;/STRONG&gt;&lt;SPAN&gt; see caching kick in (assuming 1024+ tokens and the same prompt prefix).&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Alternatives&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;Option A: Call OpenAI directly&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;If prompt caching savings are significant for your workload, bypass Databricks' Foundation Model APIs and call OpenAI's API directly using a Python UDF or notebook:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;import openai&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;client = openai.OpenAI(api_key="&amp;lt;your-openai-key&amp;gt;")&amp;nbsp; # Direct to OpenAI&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;response = client.chat.completions.create(&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;model="gpt-4o",&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;messages=[{"role": "user", "content": "&amp;lt;your 1024+ token prompt&amp;gt;"}]&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;print(response.usage.prompt_tokens_details.cached_tokens)&amp;nbsp; # Should show cache hits&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Option B: Use Databricks-hosted Claude with explicit caching&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Databricks &lt;/SPAN&gt;&lt;STRONG&gt;does&lt;/STRONG&gt;&lt;SPAN&gt; support prompt caching for Claude models via the cache_control parameter in the Foundation Model API:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;import requests&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;response = requests.post(&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;f"{db_host}/serving-endpoints/databricks-claude-sonnet-4/invocations",&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;headers={"Authorization": f"Bearer {token}"},&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;json={&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;"messages": [{&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;"role": "user",&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;"content": [&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;{"type": "text", "text": "&amp;lt;long context&amp;gt;", "cache_control": {"type": "ephemeral"}},&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;{"type": "text", "text": "Your question"}&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;]&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;}]&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;}&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Option C: Use an external model endpoint with AI Gateway&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Register your own OpenAI API key as an external model endpoint, which routes calls through Databricks' AI Gateway but directly to OpenAI. This may preserve caching behavior (though it's not guaranteed depending on routing).&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Summary&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;TABLE&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Path&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Caching Works?&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Why&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;ai_query via Databricks FMAPI&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;No&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Proxied through Databricks; no usage metadata returned&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;OpenAI SDK via Databricks endpoint&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;No&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Still proxied through Databricks&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;OpenAI SDK via api.openai.com directly&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Yes&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Direct connection, OpenAI handles routing + caching&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Databricks FMAPI with Claude models&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Yes&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Explicit cache_control parameter supported&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;H3&gt;&lt;STRONG&gt;References&lt;/STRONG&gt;&lt;/H3&gt;
&lt;UL&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://developers.openai.com/api/docs/guides/prompt-caching" target="_blank"&gt;&lt;SPAN&gt;OpenAI Prompt Caching Guide&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://docs.databricks.com/aws/en/machine-learning/foundation-model-apis/" target="_blank"&gt;&lt;SPAN&gt;Databricks Foundation Model APIs&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://docs.databricks.com/aws/en/machine-learning/model-serving/score-foundation-models" target="_blank"&gt;&lt;SPAN&gt;Databricks -- Use Foundation Models&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://www.ddhigh.com/en/2026/03/26/fix-opencode-prompt-caching-with-third-party-proxy/" target="_blank"&gt;&lt;SPAN&gt;Fix Prompt Cache Misses with Third-Party Proxy&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Fri, 10 Apr 2026 04:31:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/ai-query-and-cached-tokens/m-p/154020#M1744</guid>
      <dc:creator>anuj_lathi</dc:creator>
      <dc:date>2026-04-10T04:31:17Z</dc:date>
    </item>
  </channel>
</rss>

