In today’s AI-native world, applications no longer rely on exact keyword matches—they understand meaning. This shift is powered by embeddings: numerical representations of text that capture semantic similarity. For instance, a user searching for “moisturizer for dry skin” should also find products titled “hydrating face cream” or “intense moisture lotion”. While the words may differ, their meanings are similar—something traditional keyword search would miss. Embeddings help bridge this gap, enabling semantic search, RAG (Retrieval-Augmented Generation), and intelligent recommendations.
With pgvector— supported on Databricks Lakebase (Postgres OLTP database)—you can store, index, and query embeddings natively in SQL, bringing the power of semantic search directly to your data lakehouse.
This blog explores Postgres' native search features, specifically full-text search (keyword matching) and vector-based search (text embeddings), as an alternative to Databricks' managed Mosaic AI Vector Search. While Mosaic AI Vector Search currently offers closer integration with Unity Catalog, this guide focuses on leveraging Postgres' capabilities. The choice between Mosaic AI Vector Search and Postgres' native search depends on your specific use case.
The subsequent steps will familiarize you with Postgres' search functionalities. You will learn to search for products like "All Beauty" using semantically rich phrases such as "sun protection for sensitive skin" or "hydration booster for face."
Our semantic search solution follows this workflow:
We'll start by loading Amazon product metadata for the category ‘All Beauty’ from the Amazon Reviews 2023 dataset and enriching it with searchable text and embeddings.
CREATE CATALOG IF NOT EXISTS amazon_reviews_23;
CREATE SCHEMA IF NOT EXISTS amazon_reviews_23.datasets;
CREATE VOLUME IF NOT EXISTS amazon_reviews_23.datasets.raw;
from pyspark.sql.functions import from_json, schema_of_json, col, lit, expr, concat_ws
import requests
# Download Amazon Beauty product metadata
all_beauty_prod_meta_link = "https://mcauleylab.ucsd.edu/public_datasets/data/amazon_2023/raw/meta_categories/meta_All_Beauty.jsonl.gz"
uc_volume_path = "/Volumes/amazon_reviews_23/datasets/raw/meta_All_Beauty.jsonl.gz"
response = requests.get(all_beauty_prod_meta_link, stream=True)
response.raise_for_status()
with open(uc_volume_path, "wb") as f:
for chunk in response.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
We clean the Amazon Reviews 2023 dataset by removing duplicate columns and invalid characters from column names. Then, we create a search_text column that consolidates all relevant textual information for product searches.
def flatten_and_sanitize(df, parent=""):
"""Clean up column names and flatten nested structs in the input dataframe"""
fields = []
for field in df.schema.fields:
name = f"{parent}.{field.name}" if parent else field.name
clean_name = name.replace(" ", "_").replace(".", "_").replace(",", "_") \
.replace(";", "_").replace("{", "_").replace("}", "_") \
.replace("(", "_").replace(")", "_").replace("\n", "_") \
.replace("\t", "_").replace("=", "_")
if str(field.dataType).startswith("StructType"):
fields += flatten_and_sanitize(df.select(col(field.name + ".*")), parent=clean_name)
else:
fields.append(col(name).alias(clean_name))
return fields
def prepare_for_search(df):
"""Generate a search text column and embedding column from the input dataframe"""
df = df.withColumn(
"search_text",
concat_ws(
" ",
"parent_asin",
"title",
"main_category",
expr("array_join(categories, ' ')"),
expr("array_join(description, ' ')"),
expr("array_join(features, ' ')"),
"store",
"details_UPC",
"details_Package_Dimensions"
)
)
df = df.withColumn(
"embedding",
expr("ai_query('databricks-gte-large-en', search_text)")
)
return df
# Process the raw data
raw_df = spark.read.text(uc_volume_path)
sample_json = raw_df.limit(1).collect()[0][0]
json_schema = schema_of_json(sample_json)
parsed_df = raw_df.select(
from_json("value", json_schema).alias("data")
).select("data.*")
flat_df = (parsed_df.select(flatten_and_sanitize(parsed_df))
.transform(prepare_for_search)
)
# Save the enriched data
flat_df.write.mode("overwrite").saveAsTable(
"amazon_reviews_23.datasets.meta_all_beauty"
)
We now possess the meta_all_beauty Unity Catalog table. However, for transactional applications, product searches against this dataset would typically occur within an OLTP database. This is where Databricks Lakebase becomes essential.
As a prerequisite, you can follow these steps to create a lakebase instance and register this database as a catalog in UC. We will take the next few steps to load the data in meta_all_beauty into our Postgres database using the concept of Synced tables. Synced Tables offer a ‘managed data synchronization’ service from Unity Catalog tables to Postgres tables. You can decide ‘how’, and ‘how frequently’, the data replication happens by choosing one of the following modes:
Please note that the UI may change as the product evolves.
CREATE TABLE product_reviews_analysis_oltp.meta_all_beauty (
average_rating float8 NULL,
bought_together text NULL,
categories jsonb NULL,
description jsonb NULL,
"details_Package_Dimensions" text NULL,
"details_UPC" text NULL,
features jsonb NULL,
images jsonb NULL,
main_category text NOT NULL,
parent_asin text NOT NULL,
price text NULL,
rating_number int8 NULL,
store text NULL,
title text NULL,
videos jsonb NULL,
search_text text NULL,
embedding jsonb NULL,
CONSTRAINT meta_all_beauty_pkey PRIMARY KEY (parent_asin, main_category)
)
PARTITION BY RANGE (parent_asin, main_category);
Now we'll set up the Postgres side with pgvector support and create our search-optimized tables.
CREATE EXTENSION IF NOT EXISTS vector;
Product searches will happen against the meta_all_beauty_indexed table. This table is built using the data from the meta_all_beauty table (synced table) after adding two columns carrying indexing information for search. These columns are:
CREATE TABLE product_reviews_analysis_oltp.meta_all_beauty_indexed (
parent_asin TEXT NOT NULL,
main_category TEXT NOT NULL,
search_text TEXT NOT NULL,
embedding_pg VECTOR(1024),
search_vector TSVECTOR,
PRIMARY KEY (parent_asin, main_category)
);
INSERT INTO product_reviews_analysis_oltp.meta_all_beauty_indexed (
parent_asin, main_category, search_text, search_vector, embedding_pg
)
SELECT
parent_asin,
main_category,
search_text,
to_tsvector('english', search_text),
string_to_array(trim(both '[]' from embedding::text), ',')::float8[]::vector
FROM product_reviews_analysis_oltp.meta_all_beauty;
In case the synced table is refreshed incrementally (Continuous / Triggered Mode) and the amount of inserts / updates happening in each refresh is much lower than the prior amount of data in the synced table, you could use a ‘merge’ operation (which happens in Postgres through an INSERT + ON CONFLICT pattern) to refresh the search index table. You could add a last_modified_date column in the source tables. For a given primary key, if the incoming row has a last_modified_date greater than that of the existing row, perform an update operation instead of an insert:
INSERT INTO product_reviews_analysis_oltp.meta_all_beauty_indexed (
parent_asin,
main_category,
search_text,
search_vector,
embedding_pg,
last_modified
)
SELECT
parent_asin,
main_category,
search_text,
to_tsvector('english', search_text),
string_to_array(trim(both '[]' from embedding::text), ',')::float8[]::vector,
last_modified
FROM product_reviews_analysis_oltp.meta_all_beauty
WHERE embedding IS NOT NULL -- optional safeguard
ON CONFLICT (parent_asin, main_category)
DO UPDATE SET
search_text = EXCLUDED.search_text,
search_vector = EXCLUDED.search_vector,
embedding_pg = EXCLUDED.embedding_pg,
last_modified = EXCLUDED.last_modified
WHERE EXCLUDED.last_modified > product_reviews_analysis_oltp.meta_all_beauty_indexed.last_modified;
-- Full-text search index
CREATE INDEX idx_beauty_fts
ON product_reviews_analysis_oltp.meta_all_beauty_indexed
USING GIN (search_vector);
-- Vector similarity search index
CREATE INDEX idx_beauty_vector
ON product_reviews_analysis_oltp.meta_all_beauty_indexed
USING ivfflat (embedding_pg vector_cosine_ops)
WITH (lists = 100);
-- Update table statistics
ANALYZE product_reviews_analysis_oltp.meta_all_beauty_indexed;
Before diving into the queries, let's understand what these indexes do:
Think of GIN as a book's index but for text search. When you search for "vitamin C serum," GIN quickly finds all documents containing those words without scanning every single record. It's like having a smart librarian who knows exactly which pages contain your search terms.
Imagine you have 1 million product descriptions, and you want to find the 10 most similar ones to a query. IVFFlat is like organizing products into 100 different buckets based on their similarity. When you search, it only looks in the most relevant buckets instead of checking all 1 million products.
HNSW is like having a smart GPS for similarity search. Instead of checking every possible route, it uses a network of connections to quickly navigate to the most similar items. It's more sophisticated than IVFFlat and often faster, but uses more memory. Mosaic AI Vector Search uses HNSW indexing.
|
Index Type |
Use Case |
Speed |
Memory |
Best For |
|
GIN |
Text search |
Very Fast |
Low |
"Find products with 'vitamin C'" |
|
IVFFlat |
Vector similarity |
Fast |
Low |
"Find products similar to this one" |
|
HNSW |
Vector similarity |
Very Fast |
High |
"Find similar products in huge catalogs" |
For most applications, start with GIN for text search and IVFFlat for vector search. Upgrade to HNSW when you need maximum performance and have sufficient memory.
Now let's explore different types of search queries you can perform.
Perfect for exact keyword matches and traditional search functionality:
WITH fts_search AS (
SELECT parent_asin, main_category
FROM product_reviews_analysis_oltp.meta_all_beauty_indexed
WHERE search_vector @@ plainto_tsquery('english', 'vitamin c moisturizer')
LIMIT 10
)
SELECT
f.parent_asin,
f.main_category,
m.title,
m.price,
m.rating_number,
m.categories
FROM fts_search f
JOIN product_reviews_analysis_oltp.meta_all_beauty m
ON f.parent_asin = m.parent_asin AND f.main_category = m.main_category;
Enhance your search results with relevance scoring. The ts_rank_cd() function computes a ranking score indicating the degree to which the search_vector column aligns with the query 'vitamin c serum'. The @@ operator determines (as a boolean) if a tsvector matches a tsquery.
WITH fts_ranked AS (
SELECT
parent_asin,
main_category,
ts_rank_cd(search_vector, plainto_tsquery('english', 'vitamin c serum')) AS rank
FROM product_reviews_analysis_oltp.meta_all_beauty_indexed
WHERE search_vector @@ plainto_tsquery('english', 'vitamin c serum')
ORDER BY rank DESC
LIMIT 10
)
SELECT
r.parent_asin,
r.main_category,
m.title,
m.average_rating,
m.price,
r.rank
FROM fts_ranked r
JOIN product_reviews_analysis_oltp.meta_all_beauty m
ON r.parent_asin = m.parent_asin AND r.main_category = m.main_category;
To find semantically similar products, you generate vector embeddings for your search text using the same embedding model that created the vector index. You then query the index to retrieve documents most similar to these generated embeddings. We achieve this by utilizing Databricks' ai_query() function with the databricks-gte-large-en foundation model to generate the embeddings.
WITH vector_search AS (
SELECT parent_asin, main_category
FROM product_reviews_analysis_oltp.meta_all_beauty_indexed
ORDER BY embedding_pg <-> '[0.3544921875,-0.272705078125,...]'::vector
LIMIT 10
)
SELECT
v.parent_asin,
v.main_category,
m.title,
m.price,
m.average_rating,
m.store
FROM vector_search v
JOIN product_reviews_analysis_oltp.meta_all_beauty m
ON v.parent_asin = m.parent_asin AND v.main_category = m.main_category;
Combine the power of both full-text and vector search for the best results:
WITH search_results AS (
SELECT parent_asin, main_category
FROM product_reviews_analysis_oltp.meta_all_beauty_indexed
WHERE search_vector @@ plainto_tsquery('english', 'vitamin c moisturizer')
ORDER BY embedding_pg <-> '[0.3544921875,-0.272705078125,...]'::vector
LIMIT 10
)
SELECT
r.parent_asin,
r.main_category,
m.title,
m.store,
m.average_rating,
m.rating_number,
m.price,
m.categories,
m.features,
m.description
FROM search_results r
JOIN product_reviews_analysis_oltp.meta_all_beauty m
ON r.parent_asin = m.parent_asin AND r.main_category = m.main_category;
-- Check index usage
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE tablename = 'meta_all_beauty_indexed';
-- Monitor query performance
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM product_reviews_analysis_oltp.meta_all_beauty_indexed
WHERE search_vector @@ plainto_tsquery('english', 'vitamin c');
With pgvector on Databricks Lakebase, you can build powerful semantic search capabilities that understand meaning, not just keywords. This unified approach brings together:
The combination of Databricks' data processing power and pgvector's search capabilities creates a compelling solution for modern AI applications. Whether you're building e-commerce search, content discovery, or intelligent document retrieval, this architecture provides the foundation for sophisticated semantic search at scale.
Start with the examples in this guide, adapt them to your specific use case, and watch as your applications become more intelligent and user-friendly through the power of semantic search.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.