Databricks Community

Iris12 · ‎04-08-2025

I’d love to build a quick "art of the possible" demo showing how easy it is to query unstructured PDFs using natural language. In Snowflake, I wired up a similar solution in ~2 hours just by following their tutorial guide.

Does anyone know the best way to replicate this in Databricks? Even better—does Databricks have a similar step-by-step resource for NLP on PDFs? I did use this notebook but it is using structured data (Databricks documents chunked and embedded). I also tried Genie route and that's also expecting structured tables. No bueno there!

Basically, I have bunch of PDF files, which I would like to use natural language questions against them, even ask them to compare specific KPI present in one PDF document vs another. Hoping it is not too difficult to do this in Databricks!

Any guidance would be greatly appreciated!

Louis_Frolio · ‎04-25-2025

To query unstructured PDF files using natural language in Databricks, you can leverage an approach similar to the "Retrieval Augmented Generation (RAG) and DBRX" demo. Although the specific demo you referenced (https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html#) processes structured data, Databricks supports workflows for unstructured PDF data using similar methodologies with adaptations for raw text extraction.

Here is a step-by-step outline you could follow:

Ingest and Parse PDFs: Start by extracting text from PDF files. Libraries such as PyPDF2 (Python-based) or Spark OCR (accessible via John Snow Labs) are powerful tools for text extraction. Spark OCR, in particular, directly integrates with Apache Spark, making it well-suited for Databricks environments.
Preprocess Extracted Content:
- Once the text from PDFs is extracted, process it into manageable chunks.
- This involves normalization, tokenization, and, if necessary, splitting into sections for semantic embedding representation.
Semantic Embedding with Vector Search:
- Use a pre-trained transformer-based language model (e.g., BERT or OpenAI embeddings) to embed the extracted sections into vector representations.
- Store these embeddings in a vector database such as Databricks SQL or an external library-integrated storage option like Milvus.
Build Query Handling Logic:
- Use LangChain or other orchestration frameworks to build a custom semantic search flow.
- This system links the questions entered by users to the closest matching chunks within the PDF-based vector index.
Answer Generation and Comparison Tasks:
- For generating direct answers or comparing KPIs across documents, integrate foundational LLMs (Large Language Models) like OpenAI’s models or those accessible via Hugging Face Transformers.
- Fine-tune the models to work with the specific context of your documents, enabling detailed question-answering.
Implementation Resources:
- While Databricks doesn't currently provide an identical step-by-step guide for processing unstructured PDFs using NLP, you can adapt relevant portions of the "LLM Chatbot With RAG and DBRX" demo (e.g., vector search workflow and RAG integration) for this purpose. Additionally, exploring the Oncology Text Extraction and other accelerators might provide valuable insights.

If you’re looking for inspiration or community-proven workflows, the architecture described in "Oncology Data Extraction with NLP" that uses Spark OCR could serve as a starting point.

This approach can be implemented in Databricks within hours with proper access to the mentioned libraries, making Databricks a suitable for the task.

View solution in original post

Louis_Frolio · ‎04-25-2025

To query unstructured PDF files using natural language in Databricks, you can leverage an approach similar to the "Retrieval Augmented Generation (RAG) and DBRX" demo. Although the specific demo you referenced (https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html#) processes structured data, Databricks supports workflows for unstructured PDF data using similar methodologies with adaptations for raw text extraction.

Here is a step-by-step outline you could follow:

Ingest and Parse PDFs: Start by extracting text from PDF files. Libraries such as PyPDF2 (Python-based) or Spark OCR (accessible via John Snow Labs) are powerful tools for text extraction. Spark OCR, in particular, directly integrates with Apache Spark, making it well-suited for Databricks environments.
Preprocess Extracted Content:
- Once the text from PDFs is extracted, process it into manageable chunks.
- This involves normalization, tokenization, and, if necessary, splitting into sections for semantic embedding representation.
Semantic Embedding with Vector Search:
- Use a pre-trained transformer-based language model (e.g., BERT or OpenAI embeddings) to embed the extracted sections into vector representations.
- Store these embeddings in a vector database such as Databricks SQL or an external library-integrated storage option like Milvus.
Build Query Handling Logic:
- Use LangChain or other orchestration frameworks to build a custom semantic search flow.
- This system links the questions entered by users to the closest matching chunks within the PDF-based vector index.
Answer Generation and Comparison Tasks:
- For generating direct answers or comparing KPIs across documents, integrate foundational LLMs (Large Language Models) like OpenAI’s models or those accessible via Hugging Face Transformers.
- Fine-tune the models to work with the specific context of your documents, enabling detailed question-answering.
Implementation Resources:
- While Databricks doesn't currently provide an identical step-by-step guide for processing unstructured PDFs using NLP, you can adapt relevant portions of the "LLM Chatbot With RAG and DBRX" demo (e.g., vector search workflow and RAG integration) for this purpose. Additionally, exploring the Oncology Text Extraction and other accelerators might provide valuable insights.

If you’re looking for inspiration or community-proven workflows, the architecture described in "Oncology Data Extraction with NLP" that uses Spark OCR could serve as a starting point.

This approach can be implemented in Databricks within hours with proper access to the mentioned libraries, making Databricks a suitable for the task.

Databricks Community

How to do NLP against PDFs in Databricks? Can be done in Snowflake very easily.

Join Us as a Local Community Builder!

PSA: Community Edition retires on January 1, 2026. Move to the Free Edition today to keep your work.

🎤 Call for Presentations: Data + AI Summit 2026 is Open!

Last Chance: Help Shape the 2026 Data + AI Summit | Win a Full Conference Pass

🌟 Community Pulse: Your Weekly Roundup! December 05 – 11, 2025

Celebrating Our First Brickster Champion: Louis Frolio