Databricks Community

DinoSaluzzi · ‎08-26-2025

Hi,

I am trying to use prompt caching future using claude "databricks-claude-sonnet-4" databricks endpoint (wrapped in a ChatDatabricks instance). Using langchain, I set

 SystemMessage(
                    content=[
                        {
                            "text": cached_doc_prompt,
                            "type": "text",
                            "cache_control": {"type": "ephemeral"},
                        }
                    ]
                ),

for the part of the message I want to cache.

I get this error:

Response text: {"error_code":"BAD_REQUEST","message":"BAD_REQUEST: Databricks does not support prompt cache for the first-party Anthropic model."}

How can prompt caching be achieved in databricks?

thanks for your help!

WiliamRosa · ‎08-26-2025

Hi @DinoSaluzzi,

Anthropic’s prompt caching is not supported via Databricks endpoints → the 400 error is expected:

To use real prompt caching → call Anthropic’s API directly.
To stay within Databricks → adopt alternatives such as pseudo-cache, RAG, or context compression.

Overview with RAG:
Instead of sending the entire context with every query:

- Ingest the content into Databricks Vector Search (or another vector store).
- For each user question, retrieve only the most relevant chunks (top-k) and attach them to the prompt.

Pros: dynamically reduces tokens, scales well with large documents.
Cons: requires an embeddings + retrieval pipeline, plus tuning of chunking/top-k.

High-level skeleton:

# 1) Index documents (embeddings) → Vector Search / FAISS
# 2) For each question:
query = "User question"
contexts = retriever.get_relevant_documents(query)  # top-k
context_text = "\n\n".join(d.page_content for d in contexts)

msgs = [
    SystemMessage(content="Answer using only the provided context."),
    HumanMessage(content=f"Context:\n{context_text}\n\nQ: {query}")
]
resp = llm.invoke(msgs)

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

DinoSaluzzi · ‎08-29-2025

Hi @WiliamRosa ,

Thanks for your response!
I shall still need to prompt caching as my usecase asks for it.
Other databricks endpoints seems to work, like 'databricks-gpt-oss-120b' (using identical logic as you shared in your message). But I could not confirm the actual cache as I can not access token usage for these queries.

Best regards!

XianCao_98793 · ‎09-08-2025

Hi @WiliamRosa
good day.

We have Azure Databricks customer asked the same question and would like to know if you have roadmap to make this work in Serving Endpoint?
Customer mentioned, it seems like AWS Bedrock support prompt cache feature...