cancel
Showing results for 
Search instead for 
Did you mean: 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to implement prompt caching using Claude models?

DinoSaluzzi
New Contributor III

Hi, 

I am trying to use prompt caching future using claude "databricks-claude-sonnet-4" databricks endpoint (wrapped in a ChatDatabricks instance). Using langchain, I set 

 SystemMessage(
                    content=[
                        {
                            "text": cached_doc_prompt,
                            "type": "text",
                            "cache_control": {"type": "ephemeral"},
                        }
                    ]
                ),

for the part of the message I want to cache.

I get this error:

Response text: {"error_code":"BAD_REQUEST","message":"BAD_REQUEST: Databricks does not support prompt cache for the first-party Anthropic model."}

How can prompt caching be achieved in databricks? 

thanks for your help!

2 REPLIES 2

WiliamRosa
New Contributor II

Hi @DinoSaluzzi,

Anthropic’s prompt caching is not supported via Databricks endpoints → the 400 error is expected:WiliamRosa_0-1756240274309.png

To use real prompt caching → call Anthropic’s API directly.
To stay within Databricks → adopt alternatives such as pseudo-cache, RAG, or context compression.

Overview with RAG:
Instead of sending the entire context with every query:

- Ingest the content into Databricks Vector Search (or another vector store).
- For each user question, retrieve only the most relevant chunks (top-k) and attach them to the prompt.

Pros: dynamically reduces tokens, scales well with large documents.
Cons: requires an embeddings + retrieval pipeline, plus tuning of chunking/top-k.

High-level skeleton:

# 1) Index documents (embeddings) → Vector Search / FAISS
# 2) For each question:
query = "User question"
contexts = retriever.get_relevant_documents(query)  # top-k
context_text = "\n\n".join(d.page_content for d in contexts)

msgs = [
    SystemMessage(content="Answer using only the provided context."),
    HumanMessage(content=f"Context:\n{context_text}\n\nQ: {query}")
]
resp = llm.invoke(msgs)
Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

Hi @WiliamRosa , 

Thanks for your response! 
I shall still need to prompt caching as my usecase asks for it. 
Other databricks endpoints seems to work, like 'databricks-gpt-oss-120b' (using identical logic as you shared in your message). But I could not confirm the actual cache as I can not access token usage for these queries.

Best regards!

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now