Databricks Community

pshyvanna · ‎05-08-2026

Introduction

LLM-powered apps have two costs that compound fast: every request costs money, and users ask the same question in many ways. "How do I reset my password?", "I need to reset my password", and "Steps to reset password" are the same intent, but a traditional cache treats them as three separate lookups. And it’s not just the Foundation Model call that gets repeated in a RAG app, it's another round of vector search, in an agentic workflow it’s a fresh chain of tool calls, web searches, and multi-step reasoning. Semantic caching short-circuits all of it.

Semantic caching optimizes AI workflows by matching user queries based on their semantic intent rather than exact keyword overlap. An embedding model translates each prompt into a dense mathematical vector, a multi-dimensional coordinate array where conceptually related topics cluster tightly together. For instance, phrases like "reset my password" and "recover account credentials" map to nearly the same vector coordinates, while an unrelated query like "how is the weather?" sits far away in that space. When a new prompt arrives, the system embeds it and calculates the cosine distance to quickly locate the nearest previously cached vector. If this similarity score exceeds a strict threshold, the system immediately returns the cached response, otherwise it executes the full LLM generation and caches the new result.

The vector store here is pgvector, a PostgreSQL extension that adds a native vector type plus similarity-search operators (`<=>` cosine, `<->` L2, `<#>` inner product) and index types (IVFFlat, HNSW). With Databricks Lakebase (managed PostgreSQL with autoscaling), pgvector is built in no separate vector database to run.

In this post, we'll build a simple semantic cache using Databricks Lakebase and pgvector. The result: cache which improves responses, and reduces response latency all in about a single manageable notebook.

When Does Semantic Caching Make Sense?

Semantic caching works best when:

Users ask repetitive questions differently — Customer support bots, FAQ assistants, internal knowledge bases
Freshness tolerance is acceptable — Cached answers from an hour ago are still valid
Cost or latency matters — You're paying per token, per retrieval call, or per agent execution, and need sub-second responses

It's less useful when every query is truly unique or when answers change minute-to-minute.

Architecture

A user asks a question
We generate a vector embedding of the question using Databricks Foundation Model APIs
We search Lakebase (pgvector) for the most similar previously-asked question
Cache HIT: If similarity is above our threshold, return the cached answer instantly
Cache MISS: Call the LLM, store the question + answer + embedding in the cache, and return the answer

Why Lakebase + pgvector?

Lakebase is a great fit for semantic caching for several reasons:

Built on open source : Managed PostgreSQL with the open-source pgvector extension. Massive ecosystem, no proprietary lock-in.
One database : Your cache lives alongside other relational data (user sessions, agent context, metadata). No extra service to manage.
Autoscaling : Scales to zero when idle, scales up under load. Pay only for what you use.
Familiar tooling : psql, psycopg2, SQLAlchemy — any Postgres client. No new SDK to learn.
pgvector performance : IVFFlat or HNSW indexes handle cache-sized datasets (thousands to low millions) with sub-millisecond lookups. Multiple distance operators are supported (<=> cosine, <-> L2, <#> inner product) — pick what matches your embedding normalization.
Unity Catalog : Lakebase databases can be registered for governance and access control.

pgvector in 4 statements

It’s quite simple to get started with creating and searching vectors in PostgreSQL. Below is the sample and down below is the complete implementation.

-- Creat a table with vector datatype 
CREATE TABLE IF NOT EXISTS cache (
           question TEXT NOT NULL,
           answer TEXT NOT NULL,
           embedding vector(1024)
       );

-- Create index on the vector type column 
CREATE INDEX IF NOT EXISTS cache_embedding_idx
       ON cache USING ivfflat (embedding vector_cosine_ops) WITH (lists = 10); 

-- Insert data into the column holding vector
INSERT INTO cache (question, answer, embedding)
       VALUES (%s, %s, %s::vector)

--Query the data and sort by similarity 
SELECT  1 - (embedding <=> %s::vector) AS similarity
       FROM cache
       ORDER BY embedding <=> %s::vector
       LIMIT 1

Step-by-Step Implementation

Prerequisites

A Databricks workspace with Lakebase enabled
Databricks Foundation Model API access (for embeddings and chat)

1. Create a Lakebase Project (one time)

Follow the instructions here to create a Lakebase project and database: https://learn.microsoft.com/en-us/azure/databricks/oltp/projects/get-started

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.postgres import Project, ProjectSpec

# Initialize the Workspace client
w = WorkspaceClient()

# Create a Lakebase autoscaling instance (project)
operation = w.postgres.create_project(
      project_id=<project-id>,
      project=Project(
          spec=ProjectSpec(
              display_name="My Autoscaling Application",
              pg_version="17" # Lakebase Autoscaling supports Postgres 16 and 17
          )
      )
)

# Wait for the instance creation to fully complete
result = operation.wait()

2. Fire up Databricks Serverless notebook, Install Dependencies and Start caching

# install dependencies
%pip install psycopg2-binary pgvector openai --quiet
%pip install --upgrade databricks-sdk
dbutils.library.restartPython()

import os
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

WORKSPACE_URL = w.config.host
TOKEN = w.config.authenticate()["Authorization"].split(" ", 1)[1]
USER_EMAIL = w.current_user.me().user_name

LAKEBASE_INSTANCE_NAME = "chatbot_state"
LAKEBASE_DB = "semantic_cache"

# -- Model endpoints --
EMBEDDING_MODEL = "databricks-gte-large-en"
CHAT_MODEL = "databricks-meta-llama-3-3-70b-instruct"

# -- Cache settings --
SIMILARITY_THRESHOLD = 0.92  # Higher = stricter matching, fewer false hits
CACHE_TTL_SECONDS = 3600     # 1 hour

# Step 1: Setup Lakebase connection and initialize schema
import psycopg2

# Please consider connection pooling in your implementation
def get_database_client(instance_name, test_connection=False):
  parent_branch = f"projects/{instance_name}/branches/production"
  endpoints = list(w.postgres.list_endpoints(parent=parent_branch))
  endpoint = endpoints[0]
  cred = w.postgres.generate_database_credential(endpoint=endpoint.name)
  # Connection parameters
  conn = psycopg2.connect(
      host=endpoint.status.hosts.host,
      dbname=LAKEBASE_DB,
      user=USER_EMAIL,
      password=cred.token,
      sslmode="require",
  )
  if test_connection:
      with conn.cursor() as cur:
          cur.execute("SELECT version()")
          cur.fetchone()[0]
  return conn

import psycopg2
import requests

# model_name is stored in the table to provide reference of the model used, if the model changes (breaking change), flush the cache.
def init_schema():
  """Create the cache table and vector index if they don't exist."""
  conn = get_database_client(instance_name=LAKEBASE_INSTANCE_NAME)
  cur = conn.cursor()
  cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
  cur.execute("""
      CREATE TABLE IF NOT EXISTS cache (
          id SERIAL PRIMARY KEY,
          question TEXT NOT NULL,
          answer TEXT NOT NULL,
          embedding vector(1024),
          model_name VARCHAR(100),
          created_at TIMESTAMP DEFAULT NOW(),
          ttl_seconds INT DEFAULT 3600
      );
  """)
  cur.execute("""
      CREATE INDEX IF NOT EXISTS cache_embedding_idx
      ON cache USING ivfflat (embedding vector_cosine_ops) WITH (lists = 10);
  """)
  conn.commit()
  cur.close()
  conn.close()
  print("Schema initialized.")

# Initialize the schema (one time)
init_schema()

# Step 2: Setup embedding and chat model clients
from openai import OpenAI
import numpy as np

client = OpenAI(
  api_key=TOKEN,
  base_url=f"{WORKSPACE_URL}/serving-endpoints"
)

def get_embedding(text: str) -> list[float]:
  """Generate an embedding for the given text."""
  response = client.embeddings.create(
      model=EMBEDDING_MODEL,
      input=[text]
  )
  return response.data[0].embedding

def ask_llm(question: str) -> str:
  """Get a response from the LLM."""
  response = client.chat.completions.create(
      model=CHAT_MODEL,
      messages=[
          {"role": "system", "content": "You are a helpful assistant. Answer concisely."},
          {"role": "user", "content": question}
      ],
      max_tokens=256
  )
  return response.choices[0].message.content

# Step 3: Build the semantic cache
def cache_lookup(question: str, embedding: list[float]) -> str | None:
  """Search the cache for a semantically similar question."""
  conn = get_database_client(instance_name=LAKEBASE_INSTANCE_NAME)
  cur = conn.cursor()
  cur.execute("""
      SELECT question, answer,
             1 - (embedding <=> %s::vector) AS similarity
      FROM cache
      ORDER BY embedding <=> %s::vector
      LIMIT 1;
  """, (embedding, embedding))

  row = cur.fetchone()
  cur.close()
  conn.close()

  if row and row[2] >= SIMILARITY_THRESHOLD:
      print(f"  Cache HIT (similarity: {row[2]:.4f}, matched: '{row[0][:60]}...')")
      return row[1]

  if row:
      print(f"  Cache MISS (best similarity: {row[2]:.4f}, below threshold {SIMILARITY_THRESHOLD})")
  else:
      print("  Cache MISS (empty cache)")
  return None


def cache_store(question: str, answer: str, embedding: list[float]):
  """Store a question-answer pair in the cache."""
  conn = get_database_client(instance_name=LAKEBASE_INSTANCE_NAME)
  cur = conn.cursor()
  cur.execute("""
      INSERT INTO cache (question, answer, embedding, model_name, ttl_seconds)
      VALUES (%s, %s, %s::vector, %s, %s);
  """, (question, answer, embedding, CHAT_MODEL, CACHE_TTL_SECONDS))
  conn.commit()
  cur.close()
  conn.close()

# Step 4: Putting it all together
import time

def ask(question: str) -> str:
  """Ask a question with semantic caching."""
  print(f"\nQ: {question}")
  start = time.time()

  # Embed once — reuse for both lookup and store
  embedding = get_embedding(question)

  # Check cache
  cached = cache_lookup(question, embedding)
  if cached:
      elapsed = (time.time() - start) * 1000
      print(f"  Time: {elapsed:.0f}ms (cached)")
      return cached

  # Cache miss — call the LLM application
  answer = ask_llm(question)
  cache_store(question, answer, embedding)

  elapsed = (time.time() - start) * 1000
  print(f"  Time: {elapsed:.0f}ms (LLM call)")
  return answer

# Step 5: Demo
# First call — goes to LLM
answer1 = ask("What is Apache Spark?")
print(f"\nA: {answer1}")

# Semantically similar — should hit cache
answer2 = ask("Can you explain what Apache Spark is?")
print(f"\nA: {answer2}")

# Another rephrasing — should also hit cache
answer3 = ask("Tell me about Apache Spark")
print(f"\nA: {answer3}")

# Different question — should miss cache
answer4 = ask("What is Delta Lake?")
print(f"\nA: {answer4}")

Results

To validate the cache at scale, we ran 500 questions across 50 topic clusters. Each cluster has one canonical question (guaranteed cache miss — triggers an LLM call) and 9 rephrasings (expected cache hits). This simulates real-world usage where users ask the same things in different ways — “What is Apache Spark?”, “Can you explain what Apache Spark is?”, “Tell me about Apache Spark”, and so on.

Here’s what we saw:

Metric	Value
Total questions	500
Topic clusters	50
Cache hits	~450/500 (90%)
Cache misses	~50/500 (10%)
Correct hit rate	~95%+ of rephrasings matched
Avg cached response	~200-400ms
Avg LLM response	~2,000-4,000ms
Speedup	5-10x faster for cache hits
P95 cached latency	<800ms

A few things stood out:

Hit rate scales with cluster size. The more rephrasings per topic, the higher the hit rate. With 9 rephrasings per canonical question, we consistently saw 80-90% of them matching.
False misses happen at the edges. Rephrasings that drift too far from the canonical question — like “What is Spark in the context of big data?” for “What is Apache Spark?” — sometimes fall below the similarity threshold. This is expected and tunable.
Latency is consistent. Cached responses stayed in the 200-400ms range regardless of how many entries were in the cache. The IVFFlat index keeps lookup times flat.
Cost savings are real. With many queries served from cache, that’s fewer LLM calls, retrieval pipeline executions, or agent runs — depending on your architecture.

What a cache hit skips

Why is the cached path so much faster and cheaper? Because we skip more than just the LLM.

A full AI workflow chains multiple expensive steps:

Embedding generation — A single call to a lightweight embedding model (e.g., GTE-Large). This is the only step the cached path shares with the uncached path.
Retrieval (RAG) — Vector search over potentially millions of document chunks, re-ranking, and context assembly. Skipped on cache hit.
LLM inference — The most expensive step. Large language models process thousands of tokens at rates that scale linearly with output length. Skipped on cache hit.
Agent orchestration — For agentic workflows, a single user query can trigger multiple LLM calls, tool invocations, web searches, and reasoning loops. All of this is skipped on cache hit.

Tuning the Similarity Threshold

The SIMILARITY_THRESHOLD is the most important parameter to tune:

0.95+: Very strict — only near-identical phrasings match. Low false positive rate but lower hit rate.
0.90-0.95: Good balance for most applications. Catches common rephrasings while maintaining answer quality.
0.85-0.90: Aggressive — more cache hits but higher risk of returning answers for slightly different questions.

Start at 0.92 and adjust based on your use case. Monitor false positives (cached answer doesn't actually match the question) and adjust downward if your hit rate is too low or upward if answer quality degrades.

Cache Management

In production, you'll want TTL-based expiration and periodic cleanup:

def evict_expired():
    conn = get_connection()
    cur = conn.cursor()
    cur.execute("""
        DELETE FROM cache
        WHERE created_at < NOW() - INTERVAL '1 second' * ttl_seconds;
    """)
    deleted = cur.rowcount
    conn.commit()
    cur.close()
    conn.close()
    print(f"Evicted {deleted} expired entries.")

Summary

We built a semantic cache for LLM applications using just three Databricks components:

Lakebase — Managed PostgreSQL that autoscales and scales to zero. No infrastructure to manage.
pgvector — Vector similarity search running directly in PostgreSQL.
Foundation Model APIs — Embeddings (GTE-Large, 1024 dimensions) and chat completions (Llama 3.3 70B).

For many LLM applications, especially customer support bots, knowledge base assistants, and internal tools this pattern cuts Foundation Model costs and improves response times.

Next Steps

This pattern is intentionally minimal, one table, one similarity check, one threshold. To take it to production, here are natural extensions to explore:

Scale the plumbing, not just the model — at production traffic, the bottleneck shifts from the LLM to Postgres connections; use connection pooling instead of opening a new connection per request, tune IVFFlat lists/probes as the cache grows past ~100K rows, and ANALYZE periodically so the planner picks the index.
Multi-tenant cache isolation — partition the cache by user_id, workspace_id, or tenant_id so users only hit their own entries
Adaptive similarity thresholds — tune SIMILARITY_THRESHOLD per topic cluster or query type instead of one global value; short factual questions tolerate higher thresholds than long, nuanced ones.
Cache warming — pre-populate the cache with canonical Q&A pairs from FAQs, support tickets, or top historical queries so the first user never pays the LLM cost.
Negative caching — store and return "I don't know" or refusal responses to avoid re-running pipelines for unanswerable questions.
Source-driven invalidation — when underlying documents in your RAG corpus change, invalidate cached answers that depended on them (track provenance via a source_doc_ids column).
Cache analytics & observability — log hit/miss rates, latency percentiles, and false-positive flags to a Delta table; build a Lakeview dashboard to monitor drift over time.
Integrate with Mosaic AI Agent Framework — wrap the cache as a tool in an agent, and use MLflow Tracing to measure end-to-end savings across multi-step agent runs.