LLM-powered apps have two costs that compound fast: every request costs money, and users ask the same question in many ways. "How do I reset my password?", "I need to reset my password", and "Steps to reset password" are the same intent, but a traditional cache treats them as three separate lookups. And it’s not just the Foundation Model call that gets repeated in a RAG app, it's another round of vector search, in an agentic workflow it’s a fresh chain of tool calls, web searches, and multi-step reasoning. Semantic caching short-circuits all of it.
Semantic caching optimizes AI workflows by matching user queries based on their semantic intent rather than exact keyword overlap. An embedding model translates each prompt into a dense mathematical vector, a multi-dimensional coordinate array where conceptually related topics cluster tightly together. For instance, phrases like "reset my password" and "recover account credentials" map to nearly the same vector coordinates, while an unrelated query like "how is the weather?" sits far away in that space. When a new prompt arrives, the system embeds it and calculates the cosine distance to quickly locate the nearest previously cached vector. If this similarity score exceeds a strict threshold, the system immediately returns the cached response, otherwise it executes the full LLM generation and caches the new result.
The vector store here is pgvector, a PostgreSQL extension that adds a native vector type plus similarity-search operators (`<=>` cosine, `<->` L2, `<#>` inner product) and index types (IVFFlat, HNSW). With Databricks Lakebase (managed PostgreSQL with autoscaling), pgvector is built in no separate vector database to run.
In this post, we'll build a simple semantic cache using Databricks Lakebase and pgvector. The result: cache which improves responses, and reduces response latency all in about a single manageable notebook.
Semantic caching works best when:
It's less useful when every query is truly unique or when answers change minute-to-minute.
Lakebase is a great fit for semantic caching for several reasons:
It’s quite simple to get started with creating and searching vectors in PostgreSQL. Below is the sample and down below is the complete implementation.
-- Creat a table with vector datatype
CREATE TABLE IF NOT EXISTS cache (
question TEXT NOT NULL,
answer TEXT NOT NULL,
embedding vector(1024)
);
-- Create index on the vector type column
CREATE INDEX IF NOT EXISTS cache_embedding_idx
ON cache USING ivfflat (embedding vector_cosine_ops) WITH (lists = 10);
-- Insert data into the column holding vector
INSERT INTO cache (question, answer, embedding)
VALUES (%s, %s, %s::vector)
--Query the data and sort by similarity
SELECT 1 - (embedding <=> %s::vector) AS similarity
FROM cache
ORDER BY embedding <=> %s::vector
LIMIT 1
Follow the instructions here to create a Lakebase project and database: https://learn.microsoft.com/en-us/azure/databricks/oltp/projects/get-started
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.postgres import Project, ProjectSpec
# Initialize the Workspace client
w = WorkspaceClient()
# Create a Lakebase autoscaling instance (project)
operation = w.postgres.create_project(
project_id=<project-id>,
project=Project(
spec=ProjectSpec(
display_name="My Autoscaling Application",
pg_version="17" # Lakebase Autoscaling supports Postgres 16 and 17
)
)
)
# Wait for the instance creation to fully complete
result = operation.wait()
# install dependencies
%pip install psycopg2-binary pgvector openai --quiet
%pip install --upgrade databricks-sdk
dbutils.library.restartPython()
import os
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
WORKSPACE_URL = w.config.host
TOKEN = w.config.authenticate()["Authorization"].split(" ", 1)[1]
USER_EMAIL = w.current_user.me().user_name
LAKEBASE_INSTANCE_NAME = "chatbot_state"
LAKEBASE_DB = "semantic_cache"
# -- Model endpoints --
EMBEDDING_MODEL = "databricks-gte-large-en"
CHAT_MODEL = "databricks-meta-llama-3-3-70b-instruct"
# -- Cache settings --
SIMILARITY_THRESHOLD = 0.92 # Higher = stricter matching, fewer false hits
CACHE_TTL_SECONDS = 3600 # 1 hour
# Step 1: Setup Lakebase connection and initialize schema
import psycopg2
# Please consider connection pooling in your implementation
def get_database_client(instance_name, test_connection=False):
parent_branch = f"projects/{instance_name}/branches/production"
endpoints = list(w.postgres.list_endpoints(parent=parent_branch))
endpoint = endpoints[0]
cred = w.postgres.generate_database_credential(endpoint=endpoint.name)
# Connection parameters
conn = psycopg2.connect(
host=endpoint.status.hosts.host,
dbname=LAKEBASE_DB,
user=USER_EMAIL,
password=cred.token,
sslmode="require",
)
if test_connection:
with conn.cursor() as cur:
cur.execute("SELECT version()")
cur.fetchone()[0]
return conn
import psycopg2
import requests
# model_name is stored in the table to provide reference of the model used, if the model changes (breaking change), flush the cache.
def init_schema():
"""Create the cache table and vector index if they don't exist."""
conn = get_database_client(instance_name=LAKEBASE_INSTANCE_NAME)
cur = conn.cursor()
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
cur.execute("""
CREATE TABLE IF NOT EXISTS cache (
id SERIAL PRIMARY KEY,
question TEXT NOT NULL,
answer TEXT NOT NULL,
embedding vector(1024),
model_name VARCHAR(100),
created_at TIMESTAMP DEFAULT NOW(),
ttl_seconds INT DEFAULT 3600
);
""")
cur.execute("""
CREATE INDEX IF NOT EXISTS cache_embedding_idx
ON cache USING ivfflat (embedding vector_cosine_ops) WITH (lists = 10);
""")
conn.commit()
cur.close()
conn.close()
print("Schema initialized.")
# Initialize the schema (one time)
init_schema()
# Step 2: Setup embedding and chat model clients
from openai import OpenAI
import numpy as np
client = OpenAI(
api_key=TOKEN,
base_url=f"{WORKSPACE_URL}/serving-endpoints"
)
def get_embedding(text: str) -> list[float]:
"""Generate an embedding for the given text."""
response = client.embeddings.create(
model=EMBEDDING_MODEL,
input=[text]
)
return response.data[0].embedding
def ask_llm(question: str) -> str:
"""Get a response from the LLM."""
response = client.chat.completions.create(
model=CHAT_MODEL,
messages=[
{"role": "system", "content": "You are a helpful assistant. Answer concisely."},
{"role": "user", "content": question}
],
max_tokens=256
)
return response.choices[0].message.content
# Step 3: Build the semantic cache
def cache_lookup(question: str, embedding: list[float]) -> str | None:
"""Search the cache for a semantically similar question."""
conn = get_database_client(instance_name=LAKEBASE_INSTANCE_NAME)
cur = conn.cursor()
cur.execute("""
SELECT question, answer,
1 - (embedding <=> %s::vector) AS similarity
FROM cache
ORDER BY embedding <=> %s::vector
LIMIT 1;
""", (embedding, embedding))
row = cur.fetchone()
cur.close()
conn.close()
if row and row[2] >= SIMILARITY_THRESHOLD:
print(f" Cache HIT (similarity: {row[2]:.4f}, matched: '{row[0][:60]}...')")
return row[1]
if row:
print(f" Cache MISS (best similarity: {row[2]:.4f}, below threshold {SIMILARITY_THRESHOLD})")
else:
print(" Cache MISS (empty cache)")
return None
def cache_store(question: str, answer: str, embedding: list[float]):
"""Store a question-answer pair in the cache."""
conn = get_database_client(instance_name=LAKEBASE_INSTANCE_NAME)
cur = conn.cursor()
cur.execute("""
INSERT INTO cache (question, answer, embedding, model_name, ttl_seconds)
VALUES (%s, %s, %s::vector, %s, %s);
""", (question, answer, embedding, CHAT_MODEL, CACHE_TTL_SECONDS))
conn.commit()
cur.close()
conn.close()
# Step 4: Putting it all together
import time
def ask(question: str) -> str:
"""Ask a question with semantic caching."""
print(f"\nQ: {question}")
start = time.time()
# Embed once — reuse for both lookup and store
embedding = get_embedding(question)
# Check cache
cached = cache_lookup(question, embedding)
if cached:
elapsed = (time.time() - start) * 1000
print(f" Time: {elapsed:.0f}ms (cached)")
return cached
# Cache miss — call the LLM application
answer = ask_llm(question)
cache_store(question, answer, embedding)
elapsed = (time.time() - start) * 1000
print(f" Time: {elapsed:.0f}ms (LLM call)")
return answer
# Step 5: Demo
# First call — goes to LLM
answer1 = ask("What is Apache Spark?")
print(f"\nA: {answer1}")
# Semantically similar — should hit cache
answer2 = ask("Can you explain what Apache Spark is?")
print(f"\nA: {answer2}")
# Another rephrasing — should also hit cache
answer3 = ask("Tell me about Apache Spark")
print(f"\nA: {answer3}")
# Different question — should miss cache
answer4 = ask("What is Delta Lake?")
print(f"\nA: {answer4}")
To validate the cache at scale, we ran 500 questions across 50 topic clusters. Each cluster has one canonical question (guaranteed cache miss — triggers an LLM call) and 9 rephrasings (expected cache hits). This simulates real-world usage where users ask the same things in different ways — “What is Apache Spark?”, “Can you explain what Apache Spark is?”, “Tell me about Apache Spark”, and so on.
Here’s what we saw:
|
Metric |
Value |
|
Total questions |
500 |
|
Topic clusters |
50 |
|
Cache hits |
~450/500 (90%) |
|
Cache misses |
~50/500 (10%) |
|
Correct hit rate |
~95%+ of rephrasings matched |
|
Avg cached response |
~200-400ms |
|
Avg LLM response |
~2,000-4,000ms |
|
Speedup |
5-10x faster for cache hits |
|
P95 cached latency |
<800ms |
A few things stood out:
Why is the cached path so much faster and cheaper? Because we skip more than just the LLM.
A full AI workflow chains multiple expensive steps:
The SIMILARITY_THRESHOLD is the most important parameter to tune:
Start at 0.92 and adjust based on your use case. Monitor false positives (cached answer doesn't actually match the question) and adjust downward if your hit rate is too low or upward if answer quality degrades.
In production, you'll want TTL-based expiration and periodic cleanup:
def evict_expired():
conn = get_connection()
cur = conn.cursor()
cur.execute("""
DELETE FROM cache
WHERE created_at < NOW() - INTERVAL '1 second' * ttl_seconds;
""")
deleted = cur.rowcount
conn.commit()
cur.close()
conn.close()
print(f"Evicted {deleted} expired entries.")
We built a semantic cache for LLM applications using just three Databricks components:
For many LLM applications, especially customer support bots, knowledge base assistants, and internal tools this pattern cuts Foundation Model costs and improves response times.
This pattern is intentionally minimal, one table, one similarity check, one threshold. To take it to production, here are natural extensions to explore:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.