Re: How to implement prompt caching using Claude m...

DinoSaluzzi · ‎08-26-2025

Hi,

I am trying to use prompt caching future using claude "databricks-claude-sonnet-4" databricks endpoint (wrapped in a ChatDatabricks instance). Using langchain, I set

 SystemMessage(
                    content=[
                        {
                            "text": cached_doc_prompt,
                            "type": "text",
                            "cache_control": {"type": "ephemeral"},
                        }
                    ]
                ),

for the part of the message I want to cache.

I get this error:

Response text: {"error_code":"BAD_REQUEST","message":"BAD_REQUEST: Databricks does not support prompt cache for the first-party Anthropic model."}

How can prompt caching be achieved in databricks?

thanks for your help!

WiliamRosa · ‎08-26-2025

Hi @DinoSaluzzi,

Anthropic’s prompt caching is not supported via Databricks endpoints → the 400 error is expected:

To use real prompt caching → call Anthropic’s API directly.
To stay within Databricks → adopt alternatives such as pseudo-cache, RAG, or context compression.

Overview with RAG:
Instead of sending the entire context with every query:

- Ingest the content into Databricks Vector Search (or another vector store).
- For each user question, retrieve only the most relevant chunks (top-k) and attach them to the prompt.

Pros: dynamically reduces tokens, scales well with large documents.
Cons: requires an embeddings + retrieval pipeline, plus tuning of chunking/top-k.

High-level skeleton:

# 1) Index documents (embeddings) → Vector Search / FAISS
# 2) For each question:
query = "User question"
contexts = retriever.get_relevant_documents(query)  # top-k
context_text = "\n\n".join(d.page_content for d in contexts)

msgs = [
    SystemMessage(content="Answer using only the provided context."),
    HumanMessage(content=f"Context:\n{context_text}\n\nQ: {query}")
]
resp = llm.invoke(msgs)

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

DinoSaluzzi · ‎08-29-2025

Hi @WiliamRosa ,

Thanks for your response!
I shall still need to prompt caching as my usecase asks for it.
Other databricks endpoints seems to work, like 'databricks-gpt-oss-120b' (using identical logic as you shared in your message). But I could not confirm the actual cache as I can not access token usage for these queries.

Best regards!

XianCao_98793 · ‎09-08-2025

Hi @WiliamRosa
good day.

We have Azure Databricks customer asked the same question and would like to know if you have roadmap to make this work in Serving Endpoint?
Customer mentioned, it seems like AWS Bedrock support prompt cache feature...

Pradeep54 · ‎11-16-2025

@DinoSaluzzi If you can restructure your system& user prompt in a similar manner to the examples provided, prompt caching should start working as expected.

messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "You are a helpful Apache Spark expert. Always provide concise, technical answers.",
                "cache_control": {"type": "ephemeral"}
            }
        ]
    },
    {
        "role": "user",
        "content": "What are the top 3 benefits of using Apache Spark?"
    }
]

I can confirm that both cache_read_input_tokens and cache_creation_input_tokens are updating correctly, which indicates that caching is being applied.
Please note that prompt caching does not activate for smaller prompts , it is typically triggered only when the prompt size crosses a certain threshold.

How to implement prompt caching using Claude models?