How to perform combined search on structured and unstructured data in databrick using RAG or other

narenderkumar57 — Thu, 28 Nov 2024 12:32:20 GMT

I created a RAG application in databricks which performs the following steps:

1. Extract text from PDF files

2. Prepare embeddings on extracted text and create vector search index

3. Create a LLM model and served the model which can answer question based on pdf data.

I followed the below URL mainly to develop the same:

https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html#

Now I have some structured data in the delta tables as well. I want to perform the combined search on the pdf extracted data and the structured tables data.

I know that we can create vector search index on the structured table and use it for searching. But the problem with this approach is that I need to create a separate vector search for each table and also in vector search, the embeddings are created only on 1 column and that is used for searching via embeddings. How I can use all the columns from multiple structed tables and performed combined search via LLMs in databricks?

I also tried using Genie in databricks. I can perform search only on structured data and not on the text extracted form pdf files, so I could not use it.

I am open to use any other options also which can work in databricks.

Re: How to perform combined search on structured and unstructured data in databrick using RAG or oth

mark_ott — Fri, 07 Nov 2025 16:54:15 GMT

You can achieve combined retrieval across both PDF-extracted unstructured data and multiple columns from structured Delta tables in Databricks, but there are important considerations and available patterns to optimize this workflow for your RAG application.

Key Considerations for Combined Search

Vector Search Index Limitation: Currently, Databricks Vector Search only supports embedding generation and search over a single column per index. Creating a vector index for each table (or column) is possible, but not ideal for seamless multi-column or multi-table querying.
Metadata & Hybrid Search: You can enrich each indexed record with metadata from other columns and use hybrid search (keyword + vector similarity). However, true semantic search across all columns via embeddings is restricted to one source column per index.
Unifying Data: The best practice is to preprocess and combine the relevant text from multiple columns (and potentially tables) into a single column for embedding purposes. This allows the embeddings to represent the joint semantics of all the fields you wish to search over.
Document Chunking: For unstructured files, chunk and enrich as needed, storing relevant metadata (e.g., PDF source, page number). Then, you can concatenate structured table fields into the same chunk format if appropriate.

Recommended Approach in Databricks

Step 1: Preprocess Data
- For structured tables, create a view or preprocessed table where values from key columns are concatenated into a single text column (e.g., using SQL’s CONCAT or Python string joining). Add relevant metadata columns for filtering or context.
- For PDF-extracted text, keep your existing chunks, but consider matching their schema and metadata fields to what’s used for structured data.
Step 2: Unified Delta Table
- Merge these representations into a unified Delta table or ensure both reside in similarly structured tables (i.e., both have a main text column and rich metadata attributes).
Step 3: Embedding and Vectorizing
- Generate embeddings for the unified "search_text" column—whether it originated from structured tables or PDF chunks.
- Create a single vector index on this merged table if possible, or use separate indices with unified querying logic in your LLM chain/RAG pipeline.
Step 4: Retrieval and RAG Workflow
- When conducting retrieval for LLM augmentation, use hybrid similarity/keyword search on the main text column and metadata filtering for context (e.g., table type, document source, etc.).
- The LLM can then answer based on returned chunks regardless of origin (structured table or PDF).

Additional Databricks-Native Solutions

Unstructured.io Pipeline: Consider using the Unstructured + Databricks integration to ingest, unify, and enrich both document and table content for downstream vector search and GenAI workloads.
LangChain Integration: Tools like LangChain can help orchestrate retrieval from multiple sources, merging results and facilitating the augmented LLM responses.
Genie Limitation: Genie only operates with structured data, so it's not suitable for unified search that includes PDF-extracted text.

Example Workflow

Use Unstructured or custom ETL to transform all relevant fields (structured and unstructured) into a shared text column, with clear metadata.
Store this in a unified Delta table.
Create vector search index on this column.
At query time, filter or annotate responses by source (table name, document, page, etc.), allowing the LLM to provide joined or source-annotated answers.

Final Remarks

Databricks currently does not support seamless multi-column embedding per vector index. Combining columns before embedding is the recommended way. For more sophisticated requirements, hybrid designs (metadata filtering, post-search merging in your LLM pipeline) or using orchestration tools like LangChain or Unstructured.io will offer best results.