Hi @DinoSaluzzi,
Anthropic’s prompt caching is not supported via Databricks endpoints → the 400 error is expected:
To use real prompt caching → call Anthropic’s API directly.
To stay within Databricks → adopt alternatives such as pseudo-cache, RAG, or context compression.
Overview with RAG:
Instead of sending the entire context with every query:
- Ingest the content into Databricks Vector Search (or another vector store).
- For each user question, retrieve only the most relevant chunks (top-k) and attach them to the prompt.
Pros: dynamically reduces tokens, scales well with large documents.
Cons: requires an embeddings + retrieval pipeline, plus tuning of chunking/top-k.
High-level skeleton:
# 1) Index documents (embeddings) → Vector Search / FAISS
# 2) For each question:
query = "User question"
contexts = retriever.get_relevant_documents(query) # top-k
context_text = "\n\n".join(d.page_content for d in contexts)
msgs = [
SystemMessage(content="Answer using only the provided context."),
HumanMessage(content=f"Context:\n{context_text}\n\nQ: {query}")
]
resp = llm.invoke(msgs)
Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa