Databricks Community

uday_satapathy

Introduction

In today’s AI-native world, applications no longer rely on exact keyword matches—they understand meaning. This shift is powered by embeddings: numerical representations of text that capture semantic similarity. For instance, a user searching for “moisturizer for dry skin” should also find products titled “hydrating face cream” or “intense moisture lotion”. While the words may differ, their meanings are similar—something traditional keyword search would miss. Embeddings help bridge this gap, enabling semantic search, RAG (Retrieval-Augmented Generation), and intelligent recommendations.

With pgvector— supported on Databricks Lakebase (Postgres OLTP database)—you can store, index, and query embeddings natively in SQL, bringing the power of semantic search directly to your data lakehouse.

Why pgvector on Databricks Lakebase?

Unified Platform: No separate vector database needed—everything runs in your existing Postgres instance.
OLTP + AI: Store product metadata and embeddings together in a single system.
Lakehouse Integration: Unity Catalog tables, which are created from OLAP workloads, can be effortlessly mirrored in Lakebase using Synced tables.
100% pure Postgres: Works elegantly with psycopg, JDBC, and your favorite BI tools
Scalable Performance: IVFFlat and HNSW indexes enable fast vector search at scale
Unity Catalog Integration: Register a Postgres Database as a Catalog in UC. Data can be queried through DBSQL using query federation.

This blog explores Postgres' native search features, specifically full-text search (keyword matching) and vector-based search (text embeddings), as an alternative to Databricks' managed Mosaic AI Vector Search. While Mosaic AI Vector Search currently offers closer integration with Unity Catalog, this guide focuses on leveraging Postgres' capabilities. The choice between Mosaic AI Vector Search and Postgres' native search depends on your specific use case.

Semantic Search in Real-World

E-commerce Product Search

Semantic Product Discovery: Find products even when customers use different terminology. For example, a user searches for “comfy winter boots for icy sidewalks,” and the search returns insulated, slip-resistant boots even if the product titles only say “snow boots with rubber sole.”
Recommendation Systems: Identify similar products based on embeddings. For example, a prospect views a “14-inch lightweight business laptop,” and the site recommends similar ultrabooks from other brands based on semantic similarity of specs and reviews, not just matching “14-inch” in the title.
Search Analytics: Understand what customers are looking for through query analysis. By clustering semantically similar queries like “work from home chair,” “ergonomic office seat,” and “chair for back pain,” the retailer sees a strong demand for ergonomic chairs and decides to expand that category.

Content Management

Document Search: Find relevant documents based on meaning, not just keywords. To fulfil a GDPR compliance requirement, an employee searches “how do we handle customer data deletion requests,” and the system surfaces the “Data Subject Rights Handling Procedure” PDF even though it never uses the exact phrase “deletion requests.”
Knowledge Base: Power intelligent FAQ systems and help desk solutions. For example, a business traveller asks, “Can I use this app when I’m offline on a flight?” and the help center returns the article “Using the mobile app without an internet connection,” even though the query never mentions “mobile app” or “offline mode” explicitly.
Content Recommendations: Suggest related articles or resources. After reading an article called “How to Save Money on Groceries,” the site recommends other posts like “Beginner’s Guide to Budgeting” and “Easy Cheap Dinner Recipes,” because they’re about similar ideas (saving money), even if they don’t share the same words.

Customer Support

Ticket Classification: Automatically categorize support tickets using semantic similarity. Tickets like “charged twice on my card,” “duplicate payment on statement,” and “double billing issue” are all routed to the “Billing > Duplicate Charges” queue automatically.
Answer Retrieval: Find relevant answers from knowledge bases. A support agent types “customer wants to change shipping address after ordering” into their console, and the system surfaces the internal SOP “Post-order address change policy” as the top suggested reply.
Sentiment Analysis: Combine with other AI models for comprehensive customer insights. Support messages like “app keeps crashing,” “can’t open the app,” and “screen freezes all the time” get grouped together, and sentiment analysis shows most of them are very angry or frustrated—signaling there’s a serious new bug the team needs to fix quickly.

The subsequent steps will familiarize you with Postgres' search functionalities. You will learn to search for products like "All Beauty" using semantically rich phrases such as "sun protection for sensitive skin" or "hydration booster for face."

Architecture Overview

Our semantic search solution follows this workflow:

Data Ingestion and Enrichment: Load product data from external sources (e.g., Amazon product metadata). Create searchable text and generate embeddings using Databricks AI functions.
Synced Tables: Sync enriched data to Postgres via Lakehouse Federation
Search Optimization: Create search-optimized tables with both full-text and vector indexes
Query Execution: Perform hybrid (full-text and vector-based) searches combining semantic and keyword matching

Step 1: Data Ingestion and Enrichment in Databricks

We'll start by loading Amazon product metadata for the category ‘All Beauty’ from the Amazon Reviews 2023 dataset and enriching it with searchable text and embeddings.

Setting Up the Environment

CREATE CATALOG IF NOT EXISTS amazon_reviews_23;
CREATE SCHEMA IF NOT EXISTS amazon_reviews_23.datasets;
CREATE VOLUME IF NOT EXISTS amazon_reviews_23.datasets.raw;

Loading and Processing Data

from pyspark.sql.functions import from_json, schema_of_json, col, lit, expr, concat_ws
import requests

# Download Amazon Beauty product metadata
all_beauty_prod_meta_link = "https://mcauleylab.ucsd.edu/public_datasets/data/amazon_2023/raw/meta_categories/meta_All_Beauty.jsonl.gz"
uc_volume_path = "/Volumes/amazon_reviews_23/datasets/raw/meta_All_Beauty.jsonl.gz"

response = requests.get(all_beauty_prod_meta_link, stream=True)
response.raise_for_status()

with open(uc_volume_path, "wb") as f:
    for chunk in response.iter_content(chunk_size=8192):
        if chunk:
            f.write(chunk)

Data Transformation and Enrichment

We clean the Amazon Reviews 2023 dataset by removing duplicate columns and invalid characters from column names. Then, we create a search_text column that consolidates all relevant textual information for product searches.

def flatten_and_sanitize(df, parent=""):
    """Clean up column names and flatten nested structs in the input dataframe"""
    fields = []
    for field in df.schema.fields:
        name = f"{parent}.{field.name}" if parent else field.name
        clean_name = name.replace(" ", "_").replace(".", "_").replace(",", "_") \
                          .replace(";", "_").replace("{", "_").replace("}", "_") \
                          .replace("(", "_").replace(")", "_").replace("\n", "_") \
                          .replace("\t", "_").replace("=", "_")
        if str(field.dataType).startswith("StructType"):
            fields += flatten_and_sanitize(df.select(col(field.name + ".*")), parent=clean_name)
        else:
            fields.append(col(name).alias(clean_name))
    return fields

def prepare_for_search(df):
    """Generate a search text column and embedding column from the input dataframe"""
    df = df.withColumn(
        "search_text",
        concat_ws(
            " ",
            "parent_asin",
            "title",
            "main_category",
            expr("array_join(categories, ' ')"),
            expr("array_join(description, ' ')"),
            expr("array_join(features, ' ')"),
            "store",
            "details_UPC",
            "details_Package_Dimensions"
        )
    )
    df = df.withColumn(
        "embedding",
        expr("ai_query('databricks-gte-large-en', search_text)")
    )
    return df

# Process the raw data
raw_df = spark.read.text(uc_volume_path)
sample_json = raw_df.limit(1).collect()[0][0]
json_schema = schema_of_json(sample_json)

parsed_df = raw_df.select(
    from_json("value", json_schema).alias("data")
).select("data.*")

flat_df = (parsed_df.select(flatten_and_sanitize(parsed_df))
           .transform(prepare_for_search)
           )

# Save the enriched data
flat_df.write.mode("overwrite").saveAsTable(
    "amazon_reviews_23.datasets.meta_all_beauty"
)

Key Features of the Data Processing

Schema Flattening: Automatically handles nested JSON structures and sanitizes column names
Search Text Generation: Combines multiple fields (title, categories, description, features) into a single searchable text field
Embedding Generation: Uses Databricks' built-in ai_query function with the databricks-gte-large-en model to generate 1024-dimensional embeddings
Data Quality: Handles missing values and ensures consistent data types

Step 2: Load the table into Lakebase

We now possess the meta_all_beauty Unity Catalog table. However, for transactional applications, product searches against this dataset would typically occur within an OLTP database. This is where Databricks Lakebase becomes essential.

As a prerequisite, you can follow these steps to create a lakebase instance and register this database as a catalog in UC. We will take the next few steps to load the data in meta_all_beauty into our Postgres database using the concept of Synced tables. Synced Tables offer a ‘managed data synchronization’ service from Unity Catalog tables to Postgres tables. You can decide ‘how’, and ‘how frequently’, the data replication happens by choosing one of the following modes:

Snapshot – Replaces the entire destination with a fresh full copy of the source table on each run; most efficient for large changes.
Triggered – Copies only incremental changes since the last run when manually or programmatically triggered; balances cost and lag.
Continuous – Continuously streams changes from the source to the destination in near real time; lowest lag but highest cost.

Various phases of a Synced table - Creation, Provisioning and Online Stages

Please note that the UI may change as the product evolves.

Creation of a Synced Table

Sync begins

Rows being copied

Sync finished and the Postgres target table is online

Querying the Synced table from the Postgres SQL Editor:

The DDL for this Synced table in Postgres looks like this:

CREATE TABLE product_reviews_analysis_oltp.meta_all_beauty (
	average_rating float8 NULL,
	bought_together text NULL,
	categories jsonb NULL,
	description jsonb NULL,
	"details_Package_Dimensions" text NULL,
	"details_UPC" text NULL,
	features jsonb NULL,
	images jsonb NULL,
	main_category text NOT NULL,
	parent_asin text NOT NULL,
	price text NULL,
	rating_number int8 NULL,
	store text NULL,
	title text NULL,
	videos jsonb NULL,
	search_text text NULL,
	embedding jsonb NULL,
	CONSTRAINT meta_all_beauty_pkey PRIMARY KEY (parent_asin, main_category)
)
PARTITION BY RANGE (parent_asin, main_category);

Step 3: Postgres Setup for Search

Now we'll set up the Postgres side with pgvector support and create our search-optimized tables.

Enable pgvector Extension

CREATE EXTENSION IF NOT EXISTS vector;

Create the Search-Optimized Table

Product searches will happen against the meta_all_beauty_indexed table. This table is built using the data from the meta_all_beauty table (synced table) after adding two columns carrying indexing information for search. These columns are:

embedding_pg::VECTOR(1024): Column with PGVector datatype to house embeddings for text.
search_vector::TSVECTOR: Column with Text Search datatype, optimized for full text search.

CREATE TABLE product_reviews_analysis_oltp.meta_all_beauty_indexed (
  parent_asin TEXT NOT NULL,
  main_category TEXT NOT NULL,
  search_text TEXT NOT NULL,
  embedding_pg VECTOR(1024),
  search_vector TSVECTOR,
  PRIMARY KEY (parent_asin, main_category)
);

Populate the Table with Data

INSERT INTO product_reviews_analysis_oltp.meta_all_beauty_indexed (
  parent_asin, main_category, search_text, search_vector, embedding_pg
)
SELECT
  parent_asin,
  main_category,
  search_text,
  to_tsvector('english', search_text),
  string_to_array(trim(both '[]' from embedding::text), ',')::float8[]::vector
FROM product_reviews_analysis_oltp.meta_all_beauty;

In case the synced table is refreshed incrementally (Continuous / Triggered Mode) and the amount of inserts / updates happening in each refresh is much lower than the prior amount of data in the synced table, you could use a ‘merge’ operation (which happens in Postgres through an INSERT + ON CONFLICT pattern) to refresh the search index table. You could add a last_modified_date column in the source tables. For a given primary key, if the incoming row has a last_modified_date greater than that of the existing row, perform an update operation instead of an insert:

INSERT INTO product_reviews_analysis_oltp.meta_all_beauty_indexed (
  parent_asin,
  main_category,
  search_text,
  search_vector,
  embedding_pg,
  last_modified
)
SELECT
  parent_asin,
  main_category,
  search_text,
  to_tsvector('english', search_text),
  string_to_array(trim(both '[]' from embedding::text), ',')::float8[]::vector,
  last_modified
FROM product_reviews_analysis_oltp.meta_all_beauty
WHERE embedding IS NOT NULL  -- optional safeguard
ON CONFLICT (parent_asin, main_category)
DO UPDATE SET
  search_text = EXCLUDED.search_text,
  search_vector = EXCLUDED.search_vector,
  embedding_pg = EXCLUDED.embedding_pg,
  last_modified = EXCLUDED.last_modified
WHERE EXCLUDED.last_modified > product_reviews_analysis_oltp.meta_all_beauty_indexed.last_modified;

Create Optimized Indexes

-- Full-text search index
CREATE INDEX idx_beauty_fts
  ON product_reviews_analysis_oltp.meta_all_beauty_indexed
  USING GIN (search_vector);

-- Vector similarity search index
CREATE INDEX idx_beauty_vector
  ON product_reviews_analysis_oltp.meta_all_beauty_indexed
  USING ivfflat (embedding_pg vector_cosine_ops)
  WITH (lists = 100);

-- Update table statistics
ANALYZE product_reviews_analysis_oltp.meta_all_beauty_indexed;

Understanding the Index Types

Before diving into the queries, let's understand what these indexes do:

GIN (Generalized Inverted Index)

Think of GIN as a book's index but for text search. When you search for "vitamin C serum," GIN quickly finds all documents containing those words without scanning every single record. It's like having a smart librarian who knows exactly which pages contain your search terms.

Best for: Full-text search, exact word matches
Speed: Very fast for keyword searches
Memory: Moderate memory usage
Example: Finding all products that contain "moisturizer" or "anti-aging"

IVFFlat (Inverted File with Flat Compression)

Imagine you have 1 million product descriptions, and you want to find the 10 most similar ones to a query. IVFFlat is like organizing products into 100 different buckets based on their similarity. When you search, it only looks in the most relevant buckets instead of checking all 1 million products.

Best for: Vector similarity search, finding "similar" items
Speed: Fast for approximate similarity search
Memory: Low memory usage
Build time: Faster index build time than HNSW.
Rebuild time upon incremental updates: Generally, a full-rebuild is needed (slow).
Example: Finding products similar to "vitamin C serum" even if they don't contain those exact words

HNSW (Hierarchical Navigable Small World)

HNSW is like having a smart GPS for similarity search. Instead of checking every possible route, it uses a network of connections to quickly navigate to the most similar items. It's more sophisticated than IVFFlat and often faster, but uses more memory. Mosaic AI Vector Search uses HNSW indexing.

Best for: High-performance vector similarity search
Speed: Very fast, often faster than IVFFlat
Memory: Higher memory usage
Build time: Slower index build time than IVFFlat.
Rebuild time upon incremental updates: Fast insert into a graph. No global rebuild needed.
Example: Finding the most similar products in a large catalog with sub-second response times

When to Use Each Index

Index Type	Use Case	Speed	Memory	Best For
GIN	Text search	Very Fast	Low	"Find products with 'vitamin C'"
IVFFlat	Vector similarity	Fast	Low	"Find products similar to this one"
HNSW	Vector similarity	Very Fast	High	"Find similar products in huge catalogs"

For most applications, start with GIN for text search and IVFFlat for vector search. Upgrade to HNSW when you need maximum performance and have sufficient memory.

Step 4: Search Query Examples

Now let's explore different types of search queries you can perform.

Full-Text Search Only

Perfect for exact keyword matches and traditional search functionality:

WITH fts_search AS (
  SELECT parent_asin, main_category
  FROM product_reviews_analysis_oltp.meta_all_beauty_indexed
  WHERE search_vector @@ plainto_tsquery('english', 'vitamin c moisturizer')
  LIMIT 10
)
SELECT
  f.parent_asin,
  f.main_category,
  m.title,
  m.price,
  m.rating_number,
  m.categories
FROM fts_search f
JOIN product_reviews_analysis_oltp.meta_all_beauty m
  ON f.parent_asin = m.parent_asin AND f.main_category = m.main_category;

Ranked Full-Text Search

Enhance your search results with relevance scoring. The ts_rank_cd() function computes a ranking score indicating the degree to which the search_vector column aligns with the query 'vitamin c serum'. The @@ operator determines (as a boolean) if a tsvector matches a tsquery.

WITH fts_ranked AS (
  SELECT
    parent_asin,
    main_category,
    ts_rank_cd(search_vector, plainto_tsquery('english', 'vitamin c serum')) AS rank
  FROM product_reviews_analysis_oltp.meta_all_beauty_indexed
  WHERE search_vector @@ plainto_tsquery('english', 'vitamin c serum')
  ORDER BY rank DESC
  LIMIT 10
)
SELECT
  r.parent_asin,
  r.main_category,
  m.title,
  m.average_rating,
  m.price,
  r.rank
FROM fts_ranked r
JOIN product_reviews_analysis_oltp.meta_all_beauty m
  ON r.parent_asin = m.parent_asin AND r.main_category = m.main_category;

Vector Similarity Search

To find semantically similar products, you generate vector embeddings for your search text using the same embedding model that created the vector index. You then query the index to retrieve documents most similar to these generated embeddings. We achieve this by utilizing Databricks' ai_query() function with the databricks-gte-large-en foundation model to generate the embeddings.

WITH vector_search AS (
  SELECT parent_asin, main_category
  FROM product_reviews_analysis_oltp.meta_all_beauty_indexed
  ORDER BY embedding_pg <-> '[0.3544921875,-0.272705078125,...]'::vector
  LIMIT 10
)
SELECT
  v.parent_asin,
  v.main_category,
  m.title,
  m.price,
  m.average_rating,
  m.store
FROM vector_search v
JOIN product_reviews_analysis_oltp.meta_all_beauty m
  ON v.parent_asin = m.parent_asin AND v.main_category = m.main_category;

Hybrid Search

Combine the power of both full-text and vector search for the best results:

WITH search_results AS (
  SELECT parent_asin, main_category
  FROM product_reviews_analysis_oltp.meta_all_beauty_indexed
  WHERE search_vector @@ plainto_tsquery('english', 'vitamin c moisturizer')
  ORDER BY embedding_pg <-> '[0.3544921875,-0.272705078125,...]'::vector
  LIMIT 10
)
SELECT
  r.parent_asin,
  r.main_category,
  m.title,
  m.store,
  m.average_rating,
  m.rating_number,
  m.price,
  m.categories,
  m.features,
  m.description
FROM search_results r
JOIN product_reviews_analysis_oltp.meta_all_beauty m
  ON r.parent_asin = m.parent_asin AND r.main_category = m.main_category;

Performance Optimization Tips

Index Tuning

IVFFlat Index: Adjust the lists parameter based on your data size. For 1M vectors, use 100-1000 lists
HNSW Index: For even better performance, consider upgrading to HNSW when available
GIN Index: Optimize for your typical query patterns

Query Optimization

Use LIMIT clauses to control result set size
Combine filters to reduce the search space before vector operations
Consider using approximate search for very large datasets

Monitoring and Maintenance

-- Check index usage
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE tablename = 'meta_all_beauty_indexed';

-- Monitor query performance
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM product_reviews_analysis_oltp.meta_all_beauty_indexed
WHERE search_vector @@ plainto_tsquery('english', 'vitamin c');

Best Practices

Data Quality

Consistent Text Processing: Ensure your search text generation is consistent across all records
Embedding Quality: Use high-quality embedding models and validate results
Regular Updates: Keep embeddings current as your data changes

Security and Privacy

Access Control: Implement proper row-level security for sensitive data
Data Masking: Consider masking sensitive information in search results
Audit Logging: Track search queries for compliance and analytics

Scalability

Partitioning: Consider partitioning large tables by category or date
Caching: Implement query result caching for frequently accessed data
Load Balancing: Distribute query load across multiple read replicas

Conclusion

With pgvector on Databricks Lakebase, you can build powerful semantic search capabilities that understand meaning, not just keywords. This unified approach brings together:

Data Engineering: Seamless data processing with Spark and Delta Lake
AI Integration: Built-in embedding generation with Databricks AI functions
OLTP Performance: Fast, indexed queries with Postgres compatibility
Scalability: Enterprise-grade performance with proper indexing strategies

The combination of Databricks' data processing power and pgvector's search capabilities creates a compelling solution for modern AI applications. Whether you're building e-commerce search, content discovery, or intelligent document retrieval, this architecture provides the foundation for sophisticated semantic search at scale.

Start with the examples in this guide, adapt them to your specific use case, and watch as your applications become more intelligent and user-friendly through the power of semantic search.