cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
NikkTheGreek
Databricks Employee
Databricks Employee

Summary

  • Turn unstructured product manuals into structured, queryable data using Databricks Agent Bricks AI Functions, with no custom model training or rigid templates required.
  • Build a complete document intelligence pipeline that parses PDFs, extracts structured fields, evaluates quality, and exposes results through natural-language interfaces, all on a single platform.
  • Address real extraction challenges such as inconsistent terminology across vendors, varying document formats, and differing levels of detail, using prompt engineering and declarative schemas.

The Challenge: Unlocking Product Data from Unstructured Documents

Across manufacturing, retail, healthcare, and financial services, critical technical data resides in unstructured documents. According to Gartner and other sources around 80 % of enterprise data is unstructured. Product manuals, supplier data sheets, and specification catalogs contain the information that teams need for product comparisons, sourcing decisions, and customer support. Yet this data has rarely been systematically extracted into structured databases. While almost every enterprise values unstructured data, based on Forbes the majority—about 7 out of 10—still find it difficult to manage.

Based on research from the European Data Protection Board (edpb), manual processing of these documents is too slow, error-prone, and costly at today's scale. Traditional OCR, which became mainstream in the 1990s, could recognize text but required rigid templates for data extraction. If a manual changed its layout, the system would fail. Generative AI changes this equation fundamentally. Large language models understand context, layout, and structure. They can extract data from documents they have never encountered before, without requiring document-specific templates or training data. For more detailed insights refer to G. Colakoglu et al. and C. Deng et al..

Consider a concrete scenario: a procurement team needs to compare cordless drill/drivers across four manufacturers. Each vendor provides a PDF manual in a different format. Some span 250+ pages, others are brief instruction sheets with minimal technical detail. Building a comparison spreadsheet from these documents would require a full day of manual effort, and the process would need to be repeated every time a new product enters the catalog.

For this article, we used publicly available product manuals from four manufacturers:

This is the problem we set out to solve. The same pattern extends well beyond power tools: identifying the right toner for a printer, the compatible component for an assembly line, or the correct accessory for a device. In this article, we walk through every step of building this document extraction solution on Databricks, from parsing and extraction to evaluation and end-user interfaces.

The Solution

Our solution is the recently released Intelligent Document Processing, which uses two Databricks Agent Bricks AI Functions to go from raw PDFs to a structured product catalog:

  • ai_parse_document extracts structured content from unstructured documents. It accepts binary data from supported formats (PDF, DOCX, PPTX, PNG, JPG) and returns a structured JSON output containing text, tables, figures, and layout metadata.
  • ai_extract (v2) is the completely reworked version of the original function and at the time of writing just became Public Preview. Built by the Databricks AI Research team, it's a scalable function optimized for extracting key information from enterprise document text. It accepts a declarative JSON schema that describes the fields, types, and descriptions you want to extract, along with optional instructions to guide the model.

The pipeline is expressed entirely as streaming tables inside a Lakeflow Spark Declarative Pipeline, making it incremental and production-ready from day one leveraging our Autoloader functionalities.

The architecture is intentionally straightforward: no custom model training and no separate inference endpoints.The entire pipeline can be expressed in Python or SQL with simple prompting. In the following we will describe the three steps for a pipeline executing Intelligent Document Processing.

image1.pngFigure 1: Three steps for Intelligent Document Processing

To get started quickly, you can use the Agent Bricks Information Extraction UI via the Agents Tab to design your schema interactively and as a guide before embedding it as code.

image5.png

Figure 2: Executing the Intelligent Document Processing via the Agent Bricks UI

Step 1: Parse PDFs with ai_parse_document

We start by reading PDFs as binary files from a Unity Catalog Volume, which provides governed file storage with full access control. In the UI a few examples are processed for deeper analysis using our document parsing. In the below figure you can see how we process common text, tables and even images by generating descriptions for the shown figures, which can be leveraged by any other AI tools.

image7.png

Figure 3: Left: Original Product Manual of a Power Tool, Right: Processed document in the UI

In the background the ai_parse_document function processes each PDF and returns a structured JSON representation of the document, including text, tables, figures, and layout information:

from pyspark import pipelines as dp
import pyspark.sql.functions as F

# I/O variables
table_prefix = spark.conf.get("table")
input_volume_path = f"{spark.conf.get('volume')}/productmanuals"

@dp.table(
    name=f"{table_prefix}_productmanuals_parsed",
    comment="Table containing parsed product manual data from PDF files, including file metadata",
)
def productmanuals_parsed():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "binaryFile")
        .load(input_volume_path)
        .withColumn(
            "parsed",
            F.ai_parse_document(
                F.col("content"),
                {"version": "2.0", "descriptionElementTypes": "*"},
            ),
        )
        .select(
            F.col("_metadata.file_path").alias("path"),
            F.col("_metadata.file_name").alias("file_name"),
            F.col("_metadata.file_size").alias("file_size"),
            F.col("parsed"),
        )
    )

By leveraging a Streaming Table with  spark.readStream(), the pipeline becomes incremental. It applies streaming semantics in a declarative way so that only new files are processed on subsequent refreshes (see documentation). This means when new PDFs land in the volume, only those new files get processed on the next pipeline run.

If PDFs originate in external systems such as SharePoint, Lakeflow Connect can provision files into Unity Catalog as table with a binarised column automatically. Ai_parse_document then consumes the data directly from the generated table with the binarised raw files.

Step 2: Extract Structured Fields with ai_extract

In the UI you can now using simple prompting to generate a schema, add instructions and evaluate the information extraction results add-hoc.

image4.png

Figure 4: Configuring the Information Extraction in the UI

In the background the ai_extract function takes the parsed document along with a declarative JSON schema that describes exactly which fields to extract, their data types, and guidance for how to locate them. The extraction can be generated using the UI.

import json
from pyspark import pipelines as dp
from pyspark.sql import functions as F

# I/O variables
table_prefix = spark.conf.get("table")
input_table_path = f"{table_prefix}_productmanuals_parsed"
input_column = "parsed"
output_column = "ai_result"

# Config variables
instructions = """
Extract product specifications from this power tool manual. Focus only on the English language sections. All extracted values (including product_name) must be in English. Look for technical data in specification tables, feature lists, and product descriptions throughout the entire document. If a specification is
mentioned anywhere in the document (not just in tables), extract it. If multiple models are listed, extract the primary model.
"""
schema = json.dumps(
    {
        "manufacturer": {
            "type": "string",
            "description": "Brand or manufacturer name, e.g. Bosch, Makita, BLACK+DECKER",
        },
        "model_number": {
            "type": "string",
            "description": "Product model number or identifier, e.g. GSR 18V-65, BCD382, DF033D",
        },
        "product_name": {
            "type": "string",
            "description": "Full product name or description, e.g. Cordless Drill/Driver, 20V MAX Cordless Drill",
        },
        "product_type": {
            "type": "string",
            "description": "Type of tool: drill, drill/driver, hammer drill, impact driver, etc.",
        },
        "rated_voltage_v": {"type": "number", "description": "Rated or nominal voltage in volts"},
    }
)
@dp.table(
    name=f"{table_prefix}_productmanuals_extract",
    comment="Extracted product specifications (manufacturer, model, voltage, torque, etc.) via AI from parsed product manual PDFs",
)
def productmanuals_extract():
    sql = f"""
        ai_extract(
            {input_column},
            '{schema}',
            map('version', '2.0', 'instructions', '{instructions}')
        )
    """
    return (
        spark.readStream.table(input_table_path)
        .withColumn(output_column, F.expr(sql))
        .select("path", "file_name", "file_size", output_column)
    )

The full schema defines 13 fields covering manufacturer info, performance specifications, physical dimensions, and accessory compatibility. Two design decisions shaped the schema and are worth highlighting for anyone building their own extraction pipeline:

Schema and Field descriptions act as extraction hints. Specifying "Use the hard screwdriving value if both hard and soft are given" for torque, or "If given in lbs, convert to kg" for weight, steers the LLM toward the right value with simple prompt engineering. Giving more context by describing torque as "May appear as max torque, tightening torque, or fastening torque" helps the LLM find the right value even when manufacturers use different terminology for the same specification.

The prompt instructs the AI function. In our case some manuals contain content in multiple languages. In our case we add explicit instructions to extract only from English sections. We also instructed the function to leverage the specification tables in the documents.

You can also leverage the Agentbricks UI as guidance creating the ai_function integration.

Step 3: Flatten and Type-Cast into a Clean Table

The final step flattens the JSON result from ai_extract into strongly typed columns with descriptive column comments. These comments are important because downstream Databricks components such as Genie Space leverage them to generate more accurate results. To generate those transformations you can leverage Genie Code - our autonomous AI agent in Databricks designed for data engineering, science, and analytics.

from pyspark import pipelines as dp
import pyspark.sql.functions as F

# I/O variables
table_prefix = spark.conf.get("table")
input_table_path = f"{table_prefix}_productmanuals_extract"

# Config variables
schema = """
    file_name STRING COMMENT 'Original PDF file name.',
    manufacturer STRING COMMENT 'Brand or manufacturer name.',
    model_number STRING COMMENT 'Product model number or identifier.',
    product_name STRING COMMENT 'Full product name or description.',
    product_type STRING COMMENT 'Type of tool (drill, drill/driver, hammer drill, etc.).',
    rated_voltage_v DOUBLE COMMENT 'Rated voltage in volts.',
    ...
"""

@dp.table(
    name=f"{table_prefix}_productmanuals_processed",
    comment="Processed product catalog from power tool manuals: structured specifications for cross-vendor comparison, procurement intelligence, and product recommendation.",
    schema=schema,
)
def productmanuals_processed():
    return (
        spark.readStream.table(input_table_path)
        .select(
           F.col("file_name"),
           F.expr("ai_result:response.manufacturer::STRING").alias("manufacturer"),            
           F.expr("ai_result:response.model_number::STRING").alias("model_number"),
           F.expr("ai_result:response.product_name::STRING").alias("product_name"),
           F.expr("ai_result:response.product_type::STRING").alias("product_type"),
           F.expr("ai_result:response.rated_voltage_v::DOUBLE").alias("rated_voltage_v"),
         )
    )

All three stages are expressed as streaming tables, making the pipeline incremental end to end. Running the pipeline on four manuals from Bosch, Makita, DeWalt, and Milwaukee produces a structured product catalog ready for analysis:

image9.png 

Figure 5: The extraction results table in the Databricks App, showing structured fields extracted from each PDF.

These results illustrate an important practical consideration: extraction quality correlates directly with the quality of the source document. Professional-grade data sheets from Bosch and DeWalt yield complete specifications, while consumer instruction manuals sometimes omit detailed technical data. The pipeline handles this appropriately, returning null where data is unavailable rather than generating fabricated values.

Evaluating Quality with MLflow

In the absence of ground truth labels, how do we measure whether the extraction is performing well? We use MLflow 3 GenAI evaluation with a combination of code-based scorers (checking field completeness and numeric plausibility) and LLM-as-judge scorers (verifying that all values are in English and that extracted manufacturers and model numbers are valid). These four scorers run in a single mlflow.genai.evaluate() call:

results = mlflow.genai.evaluate(
    data=eval_data,
    scorers=[completeness, format_validator, english_check, extraction_quality]
)

image6.png

Figure 6: MLflow evaluation results showing assessment results across four documents.

Results are logged to an MLflow experiment, establishing a continuous feedback loop: refine the prompt, re-run the pipeline, re-run the evaluation, and compare metrics across runs. This follows the same experiment tracking methodology that data scientists use for traditional ML development, now applied to GenAI extraction pipelines.

Delivering Results to Business Users

A structured table is only valuable if the right people can access it. We expose the extracted data through two complementary interfaces, each serving a different type of question.

image8.png

Figure 7: End-to-end architecture, from PDF ingestion through extraction to productized interfaces on Databricks One.

The pipeline parsing the documents and extracting and postprocessing the requirement information is orchestrated by Lakeflow Jobs. The entire solution is deployed via Declarative Automation Bundles for CI/CD.

A Genie Space allows business users to ask natural-language questions about the structured product catalog without writing SQL. Questions like "Which drill has the highest max torque?" or "Compare all drills by weight and voltage" are translated into SQL and executed against the processed table. Because we invested in descriptive column comments during the processing step, Genie understands the semantics of each field and generates more accurate queries.

Not all questions can be answered from 13 extracted fields. Safety instructions, maintenance procedures, troubleshooting steps, and warranty terms live in the original documents. A Knowledge Assistant built with Agent Bricks indexes the raw PDF manuals from the same Unity Catalog Volume and provides cited, document-grounded answers to these open-ended questions. The incremental refresh is also executed by Lakeflow Jobs leveraging the Agent Bricks REST API.

A Supervisor Agent ties the Genie Space and Knowledge assistant into one interface together allowing the user asking questions about structures and unstructured data. It can be surfaced to the end user using a Databricks App as UI.

Databricks App enables developers to build and deploy secure data and AI applications directly on the Databricks platform, which eliminates the need for separate infrastructure. Apps run on the serverless platform and integrate with key platform services , including Unity Catalog for data governance, Databricks SQL for querying data, and OAuth for authentication.”

image3.png

 Figure 8: The Databricks App provides a self-service interface for uploading PDFs and triggering the extraction pipeline.

image2.png

 Figure 9: The Supervisor Agent responds to "Compare all drills by weight and voltage" with a structured table drawn from the extracted data.

Databricks One is used as the customer-facing entry point: both the Genie Space (for text-to-SQL on extracted data) and the Databricks App (for upload, run, data view, and chatbot) are accessed from there. Business Users are getting one place to work with the productized solutions in a user interface leveraging Apps, Genie Spaces or even Dashboards.

image10.png

 Figure 10: Databricks One surfaces the Genie Space, Apps, and Dashboards as a single entry point for business users.

Key Takeaways

  • No training data required. Define what to extract through a declarative schema and start processing new document types immediately.
  • SQL-native, no infrastructure overhead. The entire pipeline is expressed in SQL streaming tables. No separate model endpoints or custom inference code to maintain.
  • Incremental by default. Streaming tables process only new documents on each run, making the pipeline production-ready from day one.
  • End-to-end governance. Unity Catalog governs PDFs, extracted tables, Genie Spaces, and Knowledge Assistants under one access control model.
  • Measurable quality. MLflow evaluation provides quantitative metrics to track extraction quality over time and catch regressions early.

Applying This Architecture to Other Use Cases

The architecture we presented is not limited to product manuals. The same pattern applies to any scenario where structured data needs to be extracted from unstructured documents:

  • Manufacturing: Extract component specifications from supplier data sheets for bill-of-materials automation
  • Healthcare: Pull structured fields from clinical trial reports or medical device documentation
  • Financial Services: Parse loan agreements, insurance policies, or regulatory filings
  • Retail: Build product catalogs from vendor-provided PDFs for marketplace onboarding

For organizations looking to extend this solution, here are natural next steps:

  1. Scale to thousands of documents using file-arrival triggers on Lakeflow Jobs to process new documents automatically as they arrive.
  2. Connect external sources with Lakeflow Connect to provision files from SharePoint or cloud storage into Unity Catalog Volumes automatically (see SharePoint connector).
  3. Build dashboards with Databricks AI/BI for visual product comparison charts and specification heatmaps.
  4. Fine-tune extraction continuously using the MLflow evaluation feedback loop to track quality as document volumes grow.

The complete code for this solution is available soon as a solution accelerator in an accompanying GitHub repository. It can be deployed to any Databricks workspace using Declarative Automation Bundles, enabling organizations to begin extracting structured data from their own documents immediately.