<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How to perform combined search on structured and unstructured data in databrick using RAG or other in Generative AI</title>
    <link>https://community.databricks.com/t5/generative-ai/how-to-perform-combined-search-on-structured-and-unstructured/m-p/100334#M648</link>
    <description>&lt;P&gt;I created a RAG application in databricks which performs the following steps:&lt;/P&gt;&lt;P&gt;1. Extract text from PDF files&lt;/P&gt;&lt;P&gt;2. Prepare embeddings on extracted text and create vector search index&lt;/P&gt;&lt;P&gt;3. Create a LLM model and served the model which can answer question based on pdf data.&lt;/P&gt;&lt;P&gt;I followed the below URL mainly to develop the same:&lt;/P&gt;&lt;P&gt;&lt;A href="https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html#" target="_blank"&gt;https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html#&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Now I have some structured data in the delta tables as well. I want to perform the combined search on the pdf extracted data and the structured tables data.&lt;/P&gt;&lt;P&gt;I know that we can create vector search index on the structured table and use it for searching. But the problem with this approach is that I need to create a separate vector search for each table and also in vector search, the embeddings are created only on 1 column and that is used for searching via embeddings. How I can use all the columns from multiple structed tables and performed combined search via LLMs in databricks?&lt;/P&gt;&lt;P&gt;I also tried using Genie in databricks. I can perform search only on structured data and not on the text extracted form pdf files, so I could not use it.&lt;/P&gt;&lt;P&gt;I am open to use any other options also which can work in databricks.&lt;/P&gt;</description>
    <pubDate>Thu, 28 Nov 2024 12:32:20 GMT</pubDate>
    <dc:creator>narenderkumar57</dc:creator>
    <dc:date>2024-11-28T12:32:20Z</dc:date>
    <item>
      <title>How to perform combined search on structured and unstructured data in databrick using RAG or other</title>
      <link>https://community.databricks.com/t5/generative-ai/how-to-perform-combined-search-on-structured-and-unstructured/m-p/100334#M648</link>
      <description>&lt;P&gt;I created a RAG application in databricks which performs the following steps:&lt;/P&gt;&lt;P&gt;1. Extract text from PDF files&lt;/P&gt;&lt;P&gt;2. Prepare embeddings on extracted text and create vector search index&lt;/P&gt;&lt;P&gt;3. Create a LLM model and served the model which can answer question based on pdf data.&lt;/P&gt;&lt;P&gt;I followed the below URL mainly to develop the same:&lt;/P&gt;&lt;P&gt;&lt;A href="https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html#" target="_blank"&gt;https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html#&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Now I have some structured data in the delta tables as well. I want to perform the combined search on the pdf extracted data and the structured tables data.&lt;/P&gt;&lt;P&gt;I know that we can create vector search index on the structured table and use it for searching. But the problem with this approach is that I need to create a separate vector search for each table and also in vector search, the embeddings are created only on 1 column and that is used for searching via embeddings. How I can use all the columns from multiple structed tables and performed combined search via LLMs in databricks?&lt;/P&gt;&lt;P&gt;I also tried using Genie in databricks. I can perform search only on structured data and not on the text extracted form pdf files, so I could not use it.&lt;/P&gt;&lt;P&gt;I am open to use any other options also which can work in databricks.&lt;/P&gt;</description>
      <pubDate>Thu, 28 Nov 2024 12:32:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/how-to-perform-combined-search-on-structured-and-unstructured/m-p/100334#M648</guid>
      <dc:creator>narenderkumar57</dc:creator>
      <dc:date>2024-11-28T12:32:20Z</dc:date>
    </item>
    <item>
      <title>Re: How to perform combined search on structured and unstructured data in databrick using RAG or oth</title>
      <link>https://community.databricks.com/t5/generative-ai/how-to-perform-combined-search-on-structured-and-unstructured/m-p/138155#M1350</link>
      <description>&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;You can achieve combined retrieval across both PDF-extracted unstructured data and multiple columns from structured Delta tables in Databricks, but there are important considerations and available patterns to optimize this workflow for your RAG application.&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Key Considerations for Combined Search&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Vector Search Index Limitation:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Currently, Databricks Vector Search only supports embedding generation and search over a single column per index. Creating a vector index for each table (or column) is possible, but not ideal for seamless multi-column or multi-table querying.​&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Metadata &amp;amp; Hybrid Search:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;You can enrich each indexed record with metadata from other columns and use hybrid search (keyword + vector similarity). However, true semantic search across all columns via embeddings is restricted to one source column per index.​&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Unifying Data:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;The best practice is to preprocess and combine the relevant text from multiple columns (and potentially tables) into a single column for embedding purposes. This allows the embeddings to represent the joint semantics of all the fields you wish to search over.​&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Document Chunking:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;For unstructured files, chunk and enrich as needed, storing relevant metadata (e.g., PDF source, page number). Then, you can concatenate structured table fields into the same chunk format if appropriate.​&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Recommended Approach in Databricks&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Step 1: Preprocess Data&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;For structured tables, create a view or preprocessed table where values from key columns are concatenated into a single text column (e.g., using SQL’s&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;CONCAT&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;or Python string joining). Add relevant metadata columns for filtering or context.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;For PDF-extracted text, keep your existing chunks, but consider matching their schema and metadata fields to what’s used for structured data.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Step 2: Unified Delta Table&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Merge these representations into a unified Delta table or ensure both reside in similarly structured tables (i.e., both have a main text column and rich metadata attributes).​&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Step 3: Embedding and Vectorizing&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Generate embeddings for the unified "search_text" column—whether it originated from structured tables or PDF chunks.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Create a single vector index on this merged table if possible, or use separate indices with unified querying logic in your LLM chain/RAG pipeline.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Step 4: Retrieval and RAG Workflow&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;When conducting retrieval for LLM augmentation, use hybrid similarity/keyword search on the main text column and metadata filtering for context (e.g., table type, document source, etc.).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;The LLM can then answer based on returned chunks regardless of origin (structured table or PDF).&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Additional Databricks-Native Solutions&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Unstructured.io Pipeline:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Consider using the Unstructured + Databricks integration to ingest, unify, and enrich both document and table content for downstream vector search and GenAI workloads.​&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;LangChain Integration:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Tools like LangChain can help orchestrate retrieval from multiple sources, merging results and facilitating the augmented LLM responses.​&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Genie Limitation:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Genie only operates with structured data, so it's not suitable for unified search that includes PDF-extracted text.​&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Example Workflow&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Use Unstructured or custom ETL to transform all relevant fields (structured and unstructured) into a shared text column, with clear metadata.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Store this in a unified Delta table.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Create vector search index on this column.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;At query time, filter or annotate responses by source (table name, document, page, etc.), allowing the LLM to provide joined or source-annotated answers.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Final Remarks&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Databricks currently does not support seamless multi-column embedding per vector index. Combining columns before embedding is the recommended way. For more sophisticated requirements, hybrid designs (metadata filtering, post-search merging in your LLM pipeline) or using orchestration tools like LangChain or Unstructured.io will offer best results.​&lt;/P&gt;</description>
      <pubDate>Fri, 07 Nov 2025 16:54:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/how-to-perform-combined-search-on-structured-and-unstructured/m-p/138155#M1350</guid>
      <dc:creator>mark_ott</dc:creator>
      <dc:date>2025-11-07T16:54:15Z</dc:date>
    </item>
  </channel>
</rss>

