How to implement prompt caching using Claude models?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-26-2025 01:00 AM
Hi,
I am trying to use prompt caching future using claude "databricks-claude-sonnet-4" databricks endpoint (wrapped in a ChatDatabricks instance). Using langchain, I set
SystemMessage(
content=[
{
"text": cached_doc_prompt,
"type": "text",
"cache_control": {"type": "ephemeral"},
}
]
),for the part of the message I want to cache.
I get this error:
Response text: {"error_code":"BAD_REQUEST","message":"BAD_REQUEST: Databricks does not support prompt cache for the first-party Anthropic model."}How can prompt caching be achieved in databricks?
thanks for your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-26-2025 01:33 PM
Hi @DinoSaluzzi,
Anthropic’s prompt caching is not supported via Databricks endpoints → the 400 error is expected:
To use real prompt caching → call Anthropic’s API directly.
To stay within Databricks → adopt alternatives such as pseudo-cache, RAG, or context compression.
Overview with RAG:
Instead of sending the entire context with every query:
- Ingest the content into Databricks Vector Search (or another vector store).
- For each user question, retrieve only the most relevant chunks (top-k) and attach them to the prompt.
Pros: dynamically reduces tokens, scales well with large documents.
Cons: requires an embeddings + retrieval pipeline, plus tuning of chunking/top-k.
High-level skeleton:
# 1) Index documents (embeddings) → Vector Search / FAISS
# 2) For each question:
query = "User question"
contexts = retriever.get_relevant_documents(query) # top-k
context_text = "\n\n".join(d.page_content for d in contexts)
msgs = [
SystemMessage(content="Answer using only the provided context."),
HumanMessage(content=f"Context:\n{context_text}\n\nQ: {query}")
]
resp = llm.invoke(msgs)
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-29-2025 08:47 AM
Hi @WiliamRosa ,
Thanks for your response!
I shall still need to prompt caching as my usecase asks for it.
Other databricks endpoints seems to work, like 'databricks-gpt-oss-120b' (using identical logic as you shared in your message). But I could not confirm the actual cache as I can not access token usage for these queries.
Best regards!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-08-2025 11:42 PM
Hi @WiliamRosa
good day.
We have Azure Databricks customer asked the same question and would like to know if you have roadmap to make this work in Serving Endpoint?
Customer mentioned, it seems like AWS Bedrock support prompt cache feature...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-16-2025 08:25 PM
@DinoSaluzzi If you can restructure your system& user prompt in a similar manner to the examples provided, prompt caching should start working as expected.
messages = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a helpful Apache Spark expert. Always provide concise, technical answers.",
"cache_control": {"type": "ephemeral"}
}
]
},
{
"role": "user",
"content": "What are the top 3 benefits of using Apache Spark?"
}
]
I can confirm that both cache_read_input_tokens and cache_creation_input_tokens are updating correctly, which indicates that caching is being applied.
Please note that prompt caching does not activate for smaller prompts , it is typically triggered only when the prompt size crosses a certain threshold.