<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to do NLP against PDFs in Databricks? Can be done in Snowflake very easily. in Generative AI</title>
    <link>https://community.databricks.com/t5/generative-ai/how-to-do-nlp-against-pdfs-in-databricks-can-be-done-in/m-p/116612#M850</link>
    <description>&lt;DIV class="paragraph"&gt;To query unstructured PDF files using natural language in Databricks, you can leverage an approach similar to the "&lt;A href="https://www.databricks.com/resources/demos/tours/data-science-and-ai/databricks-dbrx-instruct-playground" target="_self"&gt;Retrieval Augmented Generation (RAG) and DBRX&lt;/A&gt;" demo. Although the specific demo you referenced (&lt;A href="https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html#" target="_blank"&gt;https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html#&lt;/A&gt;) processes structured data, Databricks supports workflows for unstructured PDF data using similar methodologies with adaptations for raw text extraction.&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;Here is a step-by-step outline you could follow:&lt;/DIV&gt;
&lt;OL start="1"&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Ingest and Parse PDFs&lt;/STRONG&gt;: Start by extracting text from PDF files. Libraries such as PyPDF2 (Python-based) or Spark OCR (accessible via John Snow Labs) are powerful tools for text extraction. Spark OCR, in particular, directly integrates with Apache Spark, making it well-suited for Databricks environments.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Preprocess Extracted Content&lt;/STRONG&gt;:
&lt;UL&gt;
&lt;LI&gt;Once the text from PDFs is extracted, process it into manageable chunks.&lt;/LI&gt;
&lt;LI&gt;This involves normalization, tokenization, and, if necessary, splitting into sections for semantic embedding representation.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Semantic Embedding with Vector Search&lt;/STRONG&gt;:
&lt;UL&gt;
&lt;LI&gt;Use a pre-trained transformer-based language model (e.g., BERT or OpenAI embeddings) to embed the extracted sections into vector representations.&lt;/LI&gt;
&lt;LI&gt;Store these embeddings in a vector database such as Databricks SQL or an external library-integrated storage option like Milvus.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Build Query Handling Logic&lt;/STRONG&gt;:
&lt;UL&gt;
&lt;LI&gt;Use LangChain or other orchestration frameworks to build a custom semantic search flow.&lt;/LI&gt;
&lt;LI&gt;This system links the questions entered by users to the closest matching chunks within the PDF-based vector index.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Answer Generation and Comparison Tasks&lt;/STRONG&gt;:
&lt;UL&gt;
&lt;LI&gt;For generating direct answers or comparing KPIs across documents, integrate foundational LLMs (Large Language Models) like OpenAI’s models or those accessible via Hugging Face Transformers.&lt;/LI&gt;
&lt;LI&gt;Fine-tune the models to work with the specific context of your documents, enabling detailed question-answering.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Implementation Resources&lt;/STRONG&gt;:
&lt;UL&gt;
&lt;LI&gt;While Databricks doesn't currently provide an identical step-by-step guide for processing unstructured PDFs using NLP, you can adapt relevant portions of the "LLM Chatbot With RAG and DBRX" demo (e.g., vector search workflow and RAG integration) for this purpose. Additionally, exploring the Oncology Text Extraction and other accelerators might provide valuable insights.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;DIV class="paragraph"&gt;If you’re looking for inspiration or community-proven workflows, the architecture described in "Oncology Data Extraction with NLP" that uses Spark OCR could serve as a starting point.&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;This approach can be implemented in Databricks within hours with proper access to the mentioned libraries, making Databricks a suitable for the task.&lt;/DIV&gt;</description>
    <pubDate>Fri, 25 Apr 2025 17:26:33 GMT</pubDate>
    <dc:creator>Louis_Frolio</dc:creator>
    <dc:date>2025-04-25T17:26:33Z</dc:date>
    <item>
      <title>How to do NLP against PDFs in Databricks? Can be done in Snowflake very easily.</title>
      <link>https://community.databricks.com/t5/generative-ai/how-to-do-nlp-against-pdfs-in-databricks-can-be-done-in/m-p/114866#M832</link>
      <description>&lt;P&gt;I’d love to build a quick "art of the possible" demo showing how easy it is to query unstructured PDFs using natural language. In Snowflake, I wired up a similar solution in ~2 hours just by following their tutorial guide.&lt;/P&gt;&lt;P&gt;Does anyone know the best way to replicate this in Databricks? Even better—does Databricks have a similar step-by-step resource for NLP on PDFs? I did use this &lt;A href="https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html#" target="_self"&gt;notebook&lt;/A&gt; but it is using structured data (Databricks documents chunked and embedded). I also tried Genie route and that's also expecting structured tables. No bueno there!&lt;/P&gt;&lt;P&gt;Basically, I have bunch of PDF files, which I would like to use natural language questions against them, even ask them to compare specific KPI present in one PDF document vs another. Hoping it is not too difficult to do this in Databricks!&lt;/P&gt;&lt;P&gt;Any guidance would be greatly appreciated!&lt;/P&gt;</description>
      <pubDate>Wed, 09 Apr 2025 00:02:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/how-to-do-nlp-against-pdfs-in-databricks-can-be-done-in/m-p/114866#M832</guid>
      <dc:creator>Iris12</dc:creator>
      <dc:date>2025-04-09T00:02:53Z</dc:date>
    </item>
    <item>
      <title>Re: How to do NLP against PDFs in Databricks? Can be done in Snowflake very easily.</title>
      <link>https://community.databricks.com/t5/generative-ai/how-to-do-nlp-against-pdfs-in-databricks-can-be-done-in/m-p/116612#M850</link>
      <description>&lt;DIV class="paragraph"&gt;To query unstructured PDF files using natural language in Databricks, you can leverage an approach similar to the "&lt;A href="https://www.databricks.com/resources/demos/tours/data-science-and-ai/databricks-dbrx-instruct-playground" target="_self"&gt;Retrieval Augmented Generation (RAG) and DBRX&lt;/A&gt;" demo. Although the specific demo you referenced (&lt;A href="https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html#" target="_blank"&gt;https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html#&lt;/A&gt;) processes structured data, Databricks supports workflows for unstructured PDF data using similar methodologies with adaptations for raw text extraction.&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;Here is a step-by-step outline you could follow:&lt;/DIV&gt;
&lt;OL start="1"&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Ingest and Parse PDFs&lt;/STRONG&gt;: Start by extracting text from PDF files. Libraries such as PyPDF2 (Python-based) or Spark OCR (accessible via John Snow Labs) are powerful tools for text extraction. Spark OCR, in particular, directly integrates with Apache Spark, making it well-suited for Databricks environments.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Preprocess Extracted Content&lt;/STRONG&gt;:
&lt;UL&gt;
&lt;LI&gt;Once the text from PDFs is extracted, process it into manageable chunks.&lt;/LI&gt;
&lt;LI&gt;This involves normalization, tokenization, and, if necessary, splitting into sections for semantic embedding representation.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Semantic Embedding with Vector Search&lt;/STRONG&gt;:
&lt;UL&gt;
&lt;LI&gt;Use a pre-trained transformer-based language model (e.g., BERT or OpenAI embeddings) to embed the extracted sections into vector representations.&lt;/LI&gt;
&lt;LI&gt;Store these embeddings in a vector database such as Databricks SQL or an external library-integrated storage option like Milvus.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Build Query Handling Logic&lt;/STRONG&gt;:
&lt;UL&gt;
&lt;LI&gt;Use LangChain or other orchestration frameworks to build a custom semantic search flow.&lt;/LI&gt;
&lt;LI&gt;This system links the questions entered by users to the closest matching chunks within the PDF-based vector index.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Answer Generation and Comparison Tasks&lt;/STRONG&gt;:
&lt;UL&gt;
&lt;LI&gt;For generating direct answers or comparing KPIs across documents, integrate foundational LLMs (Large Language Models) like OpenAI’s models or those accessible via Hugging Face Transformers.&lt;/LI&gt;
&lt;LI&gt;Fine-tune the models to work with the specific context of your documents, enabling detailed question-answering.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Implementation Resources&lt;/STRONG&gt;:
&lt;UL&gt;
&lt;LI&gt;While Databricks doesn't currently provide an identical step-by-step guide for processing unstructured PDFs using NLP, you can adapt relevant portions of the "LLM Chatbot With RAG and DBRX" demo (e.g., vector search workflow and RAG integration) for this purpose. Additionally, exploring the Oncology Text Extraction and other accelerators might provide valuable insights.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;DIV class="paragraph"&gt;If you’re looking for inspiration or community-proven workflows, the architecture described in "Oncology Data Extraction with NLP" that uses Spark OCR could serve as a starting point.&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;This approach can be implemented in Databricks within hours with proper access to the mentioned libraries, making Databricks a suitable for the task.&lt;/DIV&gt;</description>
      <pubDate>Fri, 25 Apr 2025 17:26:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/how-to-do-nlp-against-pdfs-in-databricks-can-be-done-in/m-p/116612#M850</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-04-25T17:26:33Z</dc:date>
    </item>
  </channel>
</rss>

