cancel
Showing results for 
Search instead for 
Did you mean: 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to implement prompt caching using Claude models?

DinoSaluzzi
New Contributor III

Hi, 

I am trying to use prompt caching future using claude "databricks-claude-sonnet-4" databricks endpoint (wrapped in a ChatDatabricks instance). Using langchain, I set 

 SystemMessage(
                    content=[
                        {
                            "text": cached_doc_prompt,
                            "type": "text",
                            "cache_control": {"type": "ephemeral"},
                        }
                    ]
                ),

for the part of the message I want to cache.

I get this error:

Response text: {"error_code":"BAD_REQUEST","message":"BAD_REQUEST: Databricks does not support prompt cache for the first-party Anthropic model."}

How can prompt caching be achieved in databricks? 

thanks for your help!

3 REPLIES 3

WiliamRosa
New Contributor III

Hi @DinoSaluzzi,

Anthropic’s prompt caching is not supported via Databricks endpoints → the 400 error is expected:WiliamRosa_0-1756240274309.png

To use real prompt caching → call Anthropic’s API directly.
To stay within Databricks → adopt alternatives such as pseudo-cache, RAG, or context compression.

Overview with RAG:
Instead of sending the entire context with every query:

- Ingest the content into Databricks Vector Search (or another vector store).
- For each user question, retrieve only the most relevant chunks (top-k) and attach them to the prompt.

Pros: dynamically reduces tokens, scales well with large documents.
Cons: requires an embeddings + retrieval pipeline, plus tuning of chunking/top-k.

High-level skeleton:

# 1) Index documents (embeddings) → Vector Search / FAISS
# 2) For each question:
query = "User question"
contexts = retriever.get_relevant_documents(query)  # top-k
context_text = "\n\n".join(d.page_content for d in contexts)

msgs = [
    SystemMessage(content="Answer using only the provided context."),
    HumanMessage(content=f"Context:\n{context_text}\n\nQ: {query}")
]
resp = llm.invoke(msgs)
Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

Hi @WiliamRosa , 

Thanks for your response! 
I shall still need to prompt caching as my usecase asks for it. 
Other databricks endpoints seems to work, like 'databricks-gpt-oss-120b' (using identical logic as you shared in your message). But I could not confirm the actual cache as I can not access token usage for these queries.

Best regards!

XianCao_98793
New Contributor II

Hi @WiliamRosa 
good day.

We have Azure Databricks customer asked the same question and would like to know if you have roadmap to make this work in Serving Endpoint?
Customer mentioned, it seems like AWS Bedrock support  prompt cache feature...

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now