cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
uday_satapathy
Databricks Employee
Databricks Employee

Introduction

In today’s AI-native world, applications no longer rely on exact keyword matches—they understand meaning. This shift is powered by embeddings: numerical representations of text that capture semantic similarity. For instance, a user searching for “moisturizer for dry skin” should also find products titled “hydrating face cream” or “intense moisture lotion”. While the words may differ, their meanings are similar—something traditional keyword search would miss. Embeddings help bridge this gap, enabling semantic search, RAG (Retrieval-Augmented Generation), and intelligent recommendations.

With pgvector— supported on Databricks Lakebase (Postgres OLTP database)—you can store, index, and query embeddings natively in SQL, bringing the power of semantic search directly to your data lakehouse.

 

Why pgvector on Databricks Lakebase?

  • Unified Platform: No separate vector database needed—everything runs in your existing Postgres instance. 
  • OLTP + AI: Store product metadata and embeddings together in a single system.
  • Lakehouse Integration: Unity Catalog tables, which are created from OLAP workloads, can be effortlessly mirrored in Lakebase using Synced tables.
  • 100% pure Postgres: Works elegantly with psycopg, JDBC, and your favorite BI tools
  • Scalable Performance: IVFFlat and HNSW indexes enable fast vector search at scale
  • Unity Catalog Integration: Register a Postgres Database as a Catalog in UC. Data can be queried through DBSQL using query federation.

This blog explores Postgres' native search features, specifically full-text search (keyword matching) and vector-based search (text embeddings), as an alternative to Databricks' managed Mosaic AI Vector Search. While Mosaic AI Vector Search currently offers closer integration with Unity Catalog, this guide focuses on leveraging Postgres' capabilities. The choice between Mosaic AI Vector Search and Postgres' native search depends on your specific use case.

 

Semantic Search in Real-World

E-commerce Product Search

  • Semantic Product Discovery: Find products even when customers use different terminology. For example, a user searches for “comfy winter boots for icy sidewalks,” and the search returns insulated, slip-resistant boots even if the product titles only say “snow boots with rubber sole.”
  • Recommendation Systems: Identify similar products based on embeddings. For example, a prospect views a “14-inch lightweight business laptop,” and the site recommends similar ultrabooks from other brands based on semantic similarity of specs and reviews, not just matching “14-inch” in the title.
  • Search Analytics: Understand what customers are looking for through query analysis. By clustering semantically similar queries like “work from home chair,” “ergonomic office seat,” and “chair for back pain,” the retailer sees a strong demand for ergonomic chairs and decides to expand that category.

Content Management

  • Document Search: Find relevant documents based on meaning, not just keywords. To fulfil a GDPR compliance requirement, an employee searches “how do we handle customer data deletion requests,” and the system surfaces the “Data Subject Rights Handling Procedure” PDF even though it never uses the exact phrase “deletion requests.”
  • Knowledge Base: Power intelligent FAQ systems and help desk solutions. For example, a business traveller asks, “Can I use this app when I’m offline on a flight?” and the help center returns the article “Using the mobile app without an internet connection,” even though the query never mentions “mobile app” or “offline mode” explicitly.
  • Content Recommendations: Suggest related articles or resources. After reading an article called “How to Save Money on Groceries,” the site recommends other posts like “Beginner’s Guide to Budgeting” and “Easy Cheap Dinner Recipes,” because they’re about similar ideas (saving money), even if they don’t share the same words.

Customer Support

  • Ticket Classification: Automatically categorize support tickets using semantic similarity. Tickets like “charged twice on my card,” “duplicate payment on statement,” and “double billing issue” are all routed to the “Billing > Duplicate Charges” queue automatically.
  • Answer Retrieval: Find relevant answers from knowledge bases. A support agent types “customer wants to change shipping address after ordering” into their console, and the system surfaces the internal SOP “Post-order address change policy” as the top suggested reply.
  • Sentiment Analysis: Combine with other AI models for comprehensive customer insights. Support messages like “app keeps crashing,“can’t open the app,” and “screen freezes all the time” get grouped together, and sentiment analysis shows most of them are very angry or frustrated—signaling there’s a serious new bug the team needs to fix quickly.

The subsequent steps will familiarize you with Postgres' search functionalities. You will learn to search for products like "All Beauty" using semantically rich phrases such as "sun protection for sensitive skin" or "hydration booster for face."

 

Architecture Overview

Our semantic search solution follows this workflow:

  1. Data Ingestion and Enrichment: Load product data from external sources (e.g., Amazon product metadata). Create searchable text and generate embeddings using Databricks AI functions.
  2. Synced Tables: Sync enriched data to Postgres via Lakehouse Federation
  3. Search Optimization: Create search-optimized tables with both full-text and vector indexes
  4. Query Execution: Perform hybrid (full-text and vector-based) searches combining semantic and keyword matching

uday_satapathy_0-1763668095320.png

Step 1: Data Ingestion and Enrichment in Databricks

We'll start by loading Amazon product metadata for the category ‘All Beauty’ from the Amazon Reviews 2023 dataset and enriching it with searchable text and embeddings. 

Setting Up the Environment

CREATE CATALOG IF NOT EXISTS amazon_reviews_23;
CREATE SCHEMA IF NOT EXISTS amazon_reviews_23.datasets;
CREATE VOLUME IF NOT EXISTS amazon_reviews_23.datasets.raw;

Loading and Processing Data

from pyspark.sql.functions import from_json, schema_of_json, col, lit, expr, concat_ws
import requests

# Download Amazon Beauty product metadata
all_beauty_prod_meta_link = "https://mcauleylab.ucsd.edu/public_datasets/data/amazon_2023/raw/meta_categories/meta_All_Beauty.jsonl.gz"
uc_volume_path = "/Volumes/amazon_reviews_23/datasets/raw/meta_All_Beauty.jsonl.gz"

response = requests.get(all_beauty_prod_meta_link, stream=True)
response.raise_for_status()

with open(uc_volume_path, "wb") as f:
    for chunk in response.iter_content(chunk_size=8192):
        if chunk:
            f.write(chunk)

Data Transformation and Enrichment

We clean the Amazon Reviews 2023 dataset by removing duplicate columns and invalid characters from column names. Then, we create a search_text column that consolidates all relevant textual information for product searches.

def flatten_and_sanitize(df, parent=""):
    """Clean up column names and flatten nested structs in the input dataframe"""
    fields = []
    for field in df.schema.fields:
        name = f"{parent}.{field.name}" if parent else field.name
        clean_name = name.replace(" ", "_").replace(".", "_").replace(",", "_") \
                          .replace(";", "_").replace("{", "_").replace("}", "_") \
                          .replace("(", "_").replace(")", "_").replace("\n", "_") \
                          .replace("\t", "_").replace("=", "_")
        if str(field.dataType).startswith("StructType"):
            fields += flatten_and_sanitize(df.select(col(field.name + ".*")), parent=clean_name)
        else:
            fields.append(col(name).alias(clean_name))
    return fields

def prepare_for_search(df):
    """Generate a search text column and embedding column from the input dataframe"""
    df = df.withColumn(
        "search_text",
        concat_ws(
            " ",
            "parent_asin",
            "title",
            "main_category",
            expr("array_join(categories, ' ')"),
            expr("array_join(description, ' ')"),
            expr("array_join(features, ' ')"),
            "store",
            "details_UPC",
            "details_Package_Dimensions"
        )
    )
    df = df.withColumn(
        "embedding",
        expr("ai_query('databricks-gte-large-en', search_text)")
    )
    return df

# Process the raw data
raw_df = spark.read.text(uc_volume_path)
sample_json = raw_df.limit(1).collect()[0][0]
json_schema = schema_of_json(sample_json)

parsed_df = raw_df.select(
    from_json("value", json_schema).alias("data")
).select("data.*")

flat_df = (parsed_df.select(flatten_and_sanitize(parsed_df))
           .transform(prepare_for_search)
           )

# Save the enriched data
flat_df.write.mode("overwrite").saveAsTable(
    "amazon_reviews_23.datasets.meta_all_beauty"
)

Key Features of the Data Processing

  • Schema Flattening: Automatically handles nested JSON structures and sanitizes column names
  • Search Text Generation: Combines multiple fields (title, categories, description, features) into a single searchable text field
  • Embedding Generation: Uses Databricks' built-in ai_query function with the databricks-gte-large-en model to generate 1024-dimensional embeddings
  • Data Quality: Handles missing values and ensures consistent data types

Step 2: Load the table into Lakebase

We now possess the meta_all_beauty Unity Catalog table. However, for transactional applications, product searches against this dataset would typically occur within an OLTP database. This is where Databricks Lakebase becomes essential. 

As a prerequisite, you can follow these steps to create a lakebase instance and register this database as a catalog in UC. We will take the next few steps to load the data in meta_all_beauty into our Postgres database using the concept of Synced tables. Synced Tables offer a ‘managed data synchronization’ service from Unity Catalog tables to Postgres tables. You can decide ‘how’, and ‘how frequently’, the data replication happens by choosing one of the following modes: 

  • Snapshot – Replaces the entire destination with a fresh full copy of the source table on each run; most efficient for large changes.
  • Triggered – Copies only incremental changes since the last run when manually or programmatically triggered; balances cost and lag.
  • Continuous – Continuously streams changes from the source to the destination in near real time; lowest lag but highest cost.

Various phases of a Synced table - Creation, Provisioning and Online Stages

Please note that the UI may change as the product evolves.

Creation of a Synced Table

uday_satapathy_1-1763668485591.png

 

Sync begins

uday_satapathy_2-1763668485591.png

Rows being copied

uday_satapathy_3-1763668485592.png

 

Sync finished and the Postgres target table is online

uday_satapathy_4-1763668485592.png

Querying the Synced table from the Postgres SQL Editor:

uday_satapathy_5-1763668485593.png

The DDL for this Synced table in Postgres looks like this:

CREATE TABLE product_reviews_analysis_oltp.meta_all_beauty (
	average_rating float8 NULL,
	bought_together text NULL,
	categories jsonb NULL,
	description jsonb NULL,
	"details_Package_Dimensions" text NULL,
	"details_UPC" text NULL,
	features jsonb NULL,
	images jsonb NULL,
	main_category text NOT NULL,
	parent_asin text NOT NULL,
	price text NULL,
	rating_number int8 NULL,
	store text NULL,
	title text NULL,
	videos jsonb NULL,
	search_text text NULL,
	embedding jsonb NULL,
	CONSTRAINT meta_all_beauty_pkey PRIMARY KEY (parent_asin, main_category)
)
PARTITION BY RANGE (parent_asin, main_category);

Step 3: Postgres Setup for Search

Now we'll set up the Postgres side with pgvector support and create our search-optimized tables.

Enable pgvector Extension

CREATE EXTENSION IF NOT EXISTS vector;

Create the Search-Optimized Table

Product searches will happen against the meta_all_beauty_indexed table. This table is built using the data from the meta_all_beauty table (synced table) after adding two columns carrying indexing information for search. These columns are: 

  • embedding_pg::VECTOR(1024): Column with PGVector datatype to house embeddings for text.
  • search_vector::TSVECTOR: Column with Text Search datatype, optimized for full text search.
CREATE TABLE product_reviews_analysis_oltp.meta_all_beauty_indexed (
  parent_asin TEXT NOT NULL,
  main_category TEXT NOT NULL,
  search_text TEXT NOT NULL,
  embedding_pg VECTOR(1024),
  search_vector TSVECTOR,
  PRIMARY KEY (parent_asin, main_category)
);

Populate the Table with Data

INSERT INTO product_reviews_analysis_oltp.meta_all_beauty_indexed (
  parent_asin, main_category, search_text, search_vector, embedding_pg
)
SELECT
  parent_asin,
  main_category,
  search_text,
  to_tsvector('english', search_text),
  string_to_array(trim(both '[]' from embedding::text), ',')::float8[]::vector
FROM product_reviews_analysis_oltp.meta_all_beauty;

In case the synced table is refreshed incrementally (Continuous / Triggered Mode) and the amount of inserts / updates happening in each refresh is much lower than the prior amount of data in the synced table, you could use a ‘merge’ operation (which happens in Postgres through an INSERT + ON CONFLICT pattern) to refresh the search index table. You could add a last_modified_date column in the source tables. For a given primary key, if the incoming row has a last_modified_date greater than that of the existing row, perform an update operation instead of an insert:

INSERT INTO product_reviews_analysis_oltp.meta_all_beauty_indexed (
  parent_asin,
  main_category,
  search_text,
  search_vector,
  embedding_pg,
  last_modified
)
SELECT
  parent_asin,
  main_category,
  search_text,
  to_tsvector('english', search_text),
  string_to_array(trim(both '[]' from embedding::text), ',')::float8[]::vector,
  last_modified
FROM product_reviews_analysis_oltp.meta_all_beauty
WHERE embedding IS NOT NULL  -- optional safeguard
ON CONFLICT (parent_asin, main_category)
DO UPDATE SET
  search_text = EXCLUDED.search_text,
  search_vector = EXCLUDED.search_vector,
  embedding_pg = EXCLUDED.embedding_pg,
  last_modified = EXCLUDED.last_modified
WHERE EXCLUDED.last_modified > product_reviews_analysis_oltp.meta_all_beauty_indexed.last_modified;

Create Optimized Indexes

-- Full-text search index
CREATE INDEX idx_beauty_fts
  ON product_reviews_analysis_oltp.meta_all_beauty_indexed
  USING GIN (search_vector);

-- Vector similarity search index
CREATE INDEX idx_beauty_vector
  ON product_reviews_analysis_oltp.meta_all_beauty_indexed
  USING ivfflat (embedding_pg vector_cosine_ops)
  WITH (lists = 100);

-- Update table statistics
ANALYZE product_reviews_analysis_oltp.meta_all_beauty_indexed;

Understanding the Index Types

Before diving into the queries, let's understand what these indexes do:

GIN (Generalized Inverted Index)

Think of GIN as a book's index but for text search. When you search for "vitamin C serum," GIN quickly finds all documents containing those words without scanning every single record. It's like having a smart librarian who knows exactly which pages contain your search terms.

  • Best for: Full-text search, exact word matches
  • Speed: Very fast for keyword searches
  • Memory: Moderate memory usage
  • Example: Finding all products that contain "moisturizer" or "anti-aging"

IVFFlat (Inverted File with Flat Compression)

Imagine you have 1 million product descriptions, and you want to find the 10 most similar ones to a query. IVFFlat is like organizing products into 100 different buckets based on their similarity. When you search, it only looks in the most relevant buckets instead of checking all 1 million products.

uday_satapathy_6-1763668863979.png

  • Best for: Vector similarity search, finding "similar" items
  • Speed: Fast for approximate similarity search
  • Memory: Low memory usage
  • Build time: Faster index build time than HNSW.
  • Rebuild time upon incremental updates: Generally, a full-rebuild is needed (slow).
  • Example: Finding products similar to "vitamin C serum" even if they don't contain those exact words

HNSW (Hierarchical Navigable Small World)

HNSW is like having a smart GPS for similarity search. Instead of checking every possible route, it uses a network of connections to quickly navigate to the most similar items. It's more sophisticated than IVFFlat and often faster, but uses more memory. Mosaic AI Vector Search uses HNSW indexing.

uday_satapathy_7-1763668863976.png

  • Best for: High-performance vector similarity search
  • Speed: Very fast, often faster than IVFFlat
  • Memory: Higher memory usage
  • Build time: Slower index build time than IVFFlat.
  • Rebuild time upon incremental updates: Fast insert into a graph. No global rebuild needed.
  • Example: Finding the most similar products in a large catalog with sub-second response times

When to Use Each Index

Index Type

Use Case

Speed

Memory

Best For

GIN

Text search

Very Fast

Low

"Find products with 'vitamin C'"

IVFFlat

Vector similarity

Fast

Low

"Find products similar to this one"

HNSW

Vector similarity

Very Fast

High

"Find similar products in huge catalogs"

For most applications, start with GIN for text search and IVFFlat for vector search. Upgrade to HNSW when you need maximum performance and have sufficient memory.

Step 4: Search Query Examples

Now let's explore different types of search queries you can perform.

Full-Text Search Only

Perfect for exact keyword matches and traditional search functionality:

WITH fts_search AS (
  SELECT parent_asin, main_category
  FROM product_reviews_analysis_oltp.meta_all_beauty_indexed
  WHERE search_vector @@ plainto_tsquery('english', 'vitamin c moisturizer')
  LIMIT 10
)
SELECT
  f.parent_asin,
  f.main_category,
  m.title,
  m.price,
  m.rating_number,
  m.categories
FROM fts_search f
JOIN product_reviews_analysis_oltp.meta_all_beauty m
  ON f.parent_asin = m.parent_asin AND f.main_category = m.main_category;

uday_satapathy_8-1763669333108.png

Ranked Full-Text Search

Enhance your search results with relevance scoring. The ts_rank_cd() function computes a ranking score indicating the degree to which the search_vector column aligns with the query 'vitamin c serum'. The @@ operator determines (as a boolean) if a tsvector matches a tsquery.

WITH fts_ranked AS (
  SELECT
    parent_asin,
    main_category,
    ts_rank_cd(search_vector, plainto_tsquery('english', 'vitamin c serum')) AS rank
  FROM product_reviews_analysis_oltp.meta_all_beauty_indexed
  WHERE search_vector @@ plainto_tsquery('english', 'vitamin c serum')
  ORDER BY rank DESC
  LIMIT 10
)
SELECT
  r.parent_asin,
  r.main_category,
  m.title,
  m.average_rating,
  m.price,
  r.rank
FROM fts_ranked r
JOIN product_reviews_analysis_oltp.meta_all_beauty m
  ON r.parent_asin = m.parent_asin AND r.main_category = m.main_category;

uday_satapathy_9-1763669526405.png

Vector Similarity Search

To find semantically similar products, you generate vector embeddings for your search text using the same embedding model that created the vector index. You then query the index to retrieve documents most similar to these generated embeddings. We achieve this by utilizing Databricks' ai_query() function with the databricks-gte-large-en foundation model to generate the embeddings.

WITH vector_search AS (
  SELECT parent_asin, main_category
  FROM product_reviews_analysis_oltp.meta_all_beauty_indexed
  ORDER BY embedding_pg <-> '[0.3544921875,-0.272705078125,...]'::vector
  LIMIT 10
)
SELECT
  v.parent_asin,
  v.main_category,
  m.title,
  m.price,
  m.average_rating,
  m.store
FROM vector_search v
JOIN product_reviews_analysis_oltp.meta_all_beauty m
  ON v.parent_asin = m.parent_asin AND v.main_category = m.main_category;

uday_satapathy_10-1763669606900.png

Hybrid Search

Combine the power of both full-text and vector search for the best results:

WITH search_results AS (
  SELECT parent_asin, main_category
  FROM product_reviews_analysis_oltp.meta_all_beauty_indexed
  WHERE search_vector @@ plainto_tsquery('english', 'vitamin c moisturizer')
  ORDER BY embedding_pg <-> '[0.3544921875,-0.272705078125,...]'::vector
  LIMIT 10
)
SELECT
  r.parent_asin,
  r.main_category,
  m.title,
  m.store,
  m.average_rating,
  m.rating_number,
  m.price,
  m.categories,
  m.features,
  m.description
FROM search_results r
JOIN product_reviews_analysis_oltp.meta_all_beauty m
  ON r.parent_asin = m.parent_asin AND r.main_category = m.main_category;

uday_satapathy_11-1763669654141.png

Performance Optimization Tips

Index Tuning

  • IVFFlat Index: Adjust the lists parameter based on your data size. For 1M vectors, use 100-1000 lists
  • HNSW Index: For even better performance, consider upgrading to HNSW when available
  • GIN Index: Optimize for your typical query patterns

Query Optimization

  • Use LIMIT clauses to control result set size
  • Combine filters to reduce the search space before vector operations
  • Consider using approximate search for very large datasets

Monitoring and Maintenance

-- Check index usage
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE tablename = 'meta_all_beauty_indexed';

-- Monitor query performance
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM product_reviews_analysis_oltp.meta_all_beauty_indexed
WHERE search_vector @@ plainto_tsquery('english', 'vitamin c');

Best Practices

Data Quality

  • Consistent Text Processing: Ensure your search text generation is consistent across all records
  • Embedding Quality: Use high-quality embedding models and validate results
  • Regular Updates: Keep embeddings current as your data changes

Security and Privacy

  • Access Control: Implement proper row-level security for sensitive data
  • Data Masking: Consider masking sensitive information in search results
  • Audit Logging: Track search queries for compliance and analytics

Scalability

  • Partitioning: Consider partitioning large tables by category or date
  • Caching: Implement query result caching for frequently accessed data
  • Load Balancing: Distribute query load across multiple read replicas

Conclusion

With pgvector on Databricks Lakebase, you can build powerful semantic search capabilities that understand meaning, not just keywords. This unified approach brings together:

  • Data Engineering: Seamless data processing with Spark and Delta Lake
  • AI Integration: Built-in embedding generation with Databricks AI functions
  • OLTP Performance: Fast, indexed queries with Postgres compatibility
  • Scalability: Enterprise-grade performance with proper indexing strategies

The combination of Databricks' data processing power and pgvector's search capabilities creates a compelling solution for modern AI applications. Whether you're building e-commerce search, content discovery, or intelligent document retrieval, this architecture provides the foundation for sophisticated semantic search at scale.

Start with the examples in this guide, adapt them to your specific use case, and watch as your applications become more intelligent and user-friendly through the power of semantic search.

Additional Resources

1 Comment