<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>article Intelligent Document Processing for Data Extraction: Transforming Product Manuals into Insights in Technical Blog</title>
    <link>https://community.databricks.com/t5/technical-blog/intelligent-document-processing-for-data-extraction-transforming/ba-p/153847</link>
    <description>&lt;H2 id="h.nre4qsdm11" class="c25"&gt;&lt;SPAN class="c29 c5"&gt;Summary&lt;/SPAN&gt;&lt;/H2&gt;
&lt;UL&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;STRONG&gt;Turn unstructured product manuals into structured, queryable data&lt;/STRONG&gt;&lt;SPAN&gt; using Databricks AI Functions, with no custom model training or rigid templates required.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;STRONG&gt;Build a complete document intelligence pipeline&lt;/STRONG&gt;&lt;SPAN&gt; that parses PDFs, extracts structured fields, evaluates quality, and exposes results through natural-language interfaces, all on a single platform.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;STRONG&gt;Address real extraction challenges&lt;/STRONG&gt;&lt;SPAN&gt; such as inconsistent terminology across vendors, varying document formats, and differing levels of detail, using prompt engineering and declarative schemas.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 id="h.nre4qsdm11" class="c25"&gt;&lt;SPAN class="c29 c5"&gt;The Challenge: Unlocking Product Data from Unstructured Documents&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;Across manufacturing, retail, healthcare, and financial services, critical technical data resides in unstructured documents. &lt;/SPAN&gt;&lt;SPAN&gt;According to &lt;/SPAN&gt;&lt;SPAN class="c52"&gt;&lt;A class="c8" href="https://researchworld.com/articles/possibilities-and-limitations-of-unstructured-data" target="_blank" rel="noopener"&gt;Gartner&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN&gt;, around 80% of enterprise data is unstructured. &lt;/SPAN&gt;&lt;SPAN class="c7"&gt;Product manuals, supplier data sheets, and specification catalogs contain the information that teams need for product comparisons, sourcing decisions, and customer support. Yet this data has rarely been systematically extracted into structured databases. &lt;/SPAN&gt;&lt;SPAN class="c11 c7"&gt;While almost every enterprise values unstructured data, based on &lt;/SPAN&gt;&lt;SPAN class="c18 c7 c44"&gt;&lt;A class="c8" href="https://www.forbes.com/sites/forbestechcouncil/2017/06/05/the-big-unstructured-data-problem/" target="_blank" rel="noopener"&gt;Forbes&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN class="c7 c11"&gt;&amp;nbsp;the majority—about 7 out of 10—still find it difficult to manage.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;Manual processing of these documents is too slow, error-prone, and costly at today's scale. Based on research from the &lt;/SPAN&gt;&lt;SPAN class="c18 c7"&gt;&lt;A class="c8" href="https://www.edpb.europa.eu/system/files/2024-06/ai-risks_d2optical-character-recognition_edpb-spe-programme_en_2.pdf" target="_blank" rel="noopener"&gt;European Data Protection Board (EDPB)&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;, traditional OCR could recognize text, but structured data extraction typically required rigid templates. If a document differed in structure, it would result in lower accuracy. Generative AI changes this equation fundamentally. Large language models understand context, layout, and structure. They can extract data from documents they have never encountered before, without requiring document-specific templates or training data. For more detailed insights, refer to &lt;/SPAN&gt;&lt;SPAN class="c18 c7"&gt;&lt;A class="c8" href="https://aclanthology.org/2025.findings-emnlp.973.pdf" target="_blank" rel="noopener"&gt;G. Colakoglu et al.&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;&amp;nbsp;and &lt;/SPAN&gt;&lt;SPAN class="c18 c7"&gt;&lt;A class="c8" href="https://arxiv.org/html/2412.18424v3" target="_blank" rel="noopener"&gt;C. Deng et al.&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c5 c12 c7"&gt;Consider a concrete scenario: a procurement team needs to compare cordless drill/drivers across four manufacturers. Each vendor provides a PDF manual in a different format. Some span 250+ pages, others are brief instruction sheets with minimal technical detail. Building a comparison spreadsheet from these documents would require a full day of manual effort, and the process would need to be repeated every time a new product enters the catalog.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c5 c12 c7"&gt;This is the problem we set out to solve. The same pattern extends well beyond power tools: identifying the right toner for a printer, the compatible component for an assembly line, or the correct accessory for a device. In this article, we walk through every step of building this document extraction solution on Databricks, from parsing and extraction to evaluation and end-user interfaces.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c5 c12 c7"&gt;For this article, we used publicly available product manuals from four cordless drill manufacturers as part of a &lt;A href="https://github.com/databricks-solutions/databricks-blogposts/tree/main/2026-04-Intelligent-document-processing" target="_self"&gt;Databricks Solution Accelerator available on GitHub&lt;/A&gt;.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H2 id="h.j49n2ny8jo27" class="c25"&gt;&lt;SPAN class="c49"&gt;The Solution&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;We built our solution using Databricks' &lt;/SPAN&gt;&lt;SPAN class="c18 c7"&gt;&lt;A class="c8" href="https://docs.databricks.com/aws/en/generative-ai/agent-bricks/intelligent-document-processing" target="_blank" rel="noopener"&gt;Intelligent Document Processing&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN class="c7"&gt; capabilities, &lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;leveraging two AI Functions to transform raw PDFs into a structured product catalog:&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL class="c13 lst-kix_63icp56h8wzj-0 start"&gt;
&lt;LI class="c14 li-bullet-0"&gt;&lt;SPAN class="c24 c7 c37"&gt;&lt;A class="c8" href="https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_parse_document" target="_blank" rel="noopener"&gt;ai_parse_document&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;&amp;nbsp;parses structured content from unstructured documents in supported formats (PDF, DOCX, PPTX, PNG, JPG, TIFF).&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI class="c14 li-bullet-0"&gt;&lt;SPAN class="c18 c7"&gt;&lt;A class="c8" href="https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_extract" target="_blank" rel="noopener"&gt;ai_extract&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;&amp;nbsp;(v2) extracts structured data from text and documents according to a provided schema.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;The pipeline is expressed entirely as streaming tables inside a &lt;/SPAN&gt;&lt;SPAN class="c18 c6"&gt;&lt;A class="c8" href="https://docs.databricks.com/aws/en/ldp/" target="_blank" rel="noopener"&gt;Lakeflow Spark Declarative Pipeline&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;, making it incremental and production-ready from day one, leveraging &lt;/SPAN&gt;&lt;SPAN class="c18 c7"&gt;&lt;A class="c8" href="https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/" target="_blank" rel="noopener"&gt;Autoloader&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;&amp;nbsp;for efficient file ingestion.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;The architecture is intentionally straightforward: no custom model training and no separate inference endpoints. The entire pipeline can be expressed in Python or SQL, with extraction guided by simple prompting. &lt;/SPAN&gt;&lt;SPAN class="c7"&gt;Below, we describe the three steps of an Intelligent Document Processing pipeline.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="image1.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/26236i110C8B3075C4D273/image-size/large?v=v2&amp;amp;px=999" role="button" title="image1.png" alt="image1.png" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;SPAN class="c12 c17 c7"&gt;Figure 1: Three steps for Intelligent Document Processing&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c43"&gt;&lt;SPAN class="c7"&gt;To get started quickly, the&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;Agent Bricks Information Extraction UI via the Agents Tab can be used to design the schema interactively and as a guide before converting it to code.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c43"&gt;&lt;SPAN&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="image5.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/26238i4701BBA4939F0FFA/image-size/large?v=v2&amp;amp;px=999" role="button" title="image5.png" alt="image5.png" /&gt;&lt;/span&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c12 c17 c7"&gt;Figure 2: Executing the Intelligent Document Processing via the Agent Bricks UI&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3 id="h.ot89fn269twf" class="c38"&gt;&lt;SPAN class="c36"&gt;Step 1: Parse PDFs with ai_parse_document&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;We start by reading PDFs as binary files from a &lt;/SPAN&gt;&lt;SPAN class="c6"&gt;Unity Catalog Volume&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;, which provides governed file storage. In the UI, a few examples are processed for deeper analysis using the document parsing capability. The figure below shows how the pipeline processes text, tables, and images, generating descriptions for the figures shown, which can be used by downstream AI tools.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="image7.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/26239i2F2A90BA27255088/image-size/large?v=v2&amp;amp;px=999" role="button" title="image7.png" alt="image7.png" /&gt;&lt;/span&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7 c45"&gt;Figure 3: Left: Original Product Manual of a Power Tool, Right: Processed document in the UI&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;In the background, the &lt;/SPAN&gt;&lt;SPAN class="c19 c7"&gt;ai_parse_document&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;&amp;nbsp;function processes each PDF and returns a structured representation of the document, including text, tables, figures, and layout information:&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from pyspark import pipelines as dp
import pyspark.sql.functions as F

# I/O variables
table_prefix = spark.conf.get("table")
input_volume_path = f"{spark.conf.get('volume')}/productmanuals"

@dp.table(
    name=f"{table_prefix}_productmanuals_parsed",
    comment="Table containing parsed product manual data from PDF files, including file metadata",
)
def productmanuals_parsed():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "binaryFile")
        .load(input_volume_path)
        .withColumn(
            "parsed",
            F.ai_parse_document(
                F.col("content"),
                {"version": "2.0", "descriptionElementTypes": "*"},
            ),
        )
        .select(
            F.col("_metadata.file_path").alias("path"),
            F.col("_metadata.file_name").alias("file_name"),
            F.col("_metadata.file_size").alias("file_size"),
            F.col("parsed"),
        )
    )
&lt;/LI-CODE&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;By leveraging a&lt;/SPAN&gt;&lt;SPAN class="c19 c7"&gt;&amp;nbsp;Streaming Table&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;&amp;nbsp;with &lt;/SPAN&gt;&lt;SPAN class="c19 c7"&gt;spark.readStream()&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;, the pipeline becomes incremental. It applies streaming semantics in a declarative way, meaning only new PDFs landing in the volume get processed on the next pipeline run.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;If PDFs originate in external systems such as SharePoint, &lt;/SPAN&gt;&lt;SPAN class="c18 c7"&gt;&lt;A class="c8" href="https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/sharepoint-source-setup-overview" target="_blank" rel="noopener"&gt;Lakeflow Connect&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;&amp;nbsp;can ingest files into Unity Catalog, which can then be processed with &lt;/SPAN&gt;&lt;SPAN class="c19 c7"&gt;ai_parse_document&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3 id="h.mvboabgkur6j" class="c38"&gt;&lt;SPAN class="c36"&gt;Step 2: Extract Structured Fields with ai_extract&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P class="c3"&gt;&lt;SPAN&gt;In the UI, a schema can be generated using simple prompting, with options to add instructions and evaluate the information extraction results ad hoc.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="image4.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/26241i7F1174D45DAF9211/image-size/large?v=v2&amp;amp;px=999" role="button" title="image4.png" alt="image4.png" /&gt;&lt;/span&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c45 c7"&gt;Figure 4: Configuring the Information Extraction in the UI&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;In the background, the &lt;/SPAN&gt;&lt;SPAN class="c19 c7"&gt;ai_extract&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;&amp;nbsp;function takes the parsed document along with a declarative JSON schema that describes exactly which fields to extract, their data types, and guidance for how to locate them. The schema can also be configured using the UI.&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;import json
from pyspark import pipelines as dp
from pyspark.sql import functions as F

# I/O variables
table_prefix = spark.conf.get("table")
input_table_path = f"{table_prefix}_productmanuals_parsed"
input_column = "parsed"
output_column = "ai_result"

# Config variables
instructions = """
Extract product specifications from this power tool manual. Focus only on the English language sections. All extracted values (including product_name) must be in English. Look for technical data in specification tables, feature lists, and product descriptions throughout the entire document. If a specification is
mentioned anywhere in the document (not just in tables), extract it. If multiple models are listed, extract the primary model.
"""
schema = json.dumps(
    {
        "manufacturer": {
            "type": "string",
            "description": "Brand or manufacturer name, e.g. Bosch, Makita, BLACK+DECKER",
        },
        "model_number": {
            "type": "string",
            "description": "Product model number or identifier, e.g. GSR 18V-65, BCD382, DF033D",
        },
        "product_name": {
            "type": "string",
            "description": "Full product name or description, e.g. Cordless Drill/Driver, 20V MAX Cordless Drill",
        },
        "product_type": {
            "type": "string",
            "description": "Type of tool: drill, drill/driver, hammer drill, impact driver, etc.",
        },
        "rated_voltage_v": {"type": "number", "description": "Rated or nominal voltage in volts"},
    }
)
@dp.table(
    name=f"{table_prefix}_productmanuals_extract",
    comment="Extracted product specifications (manufacturer, model, voltage, torque, etc.) via AI from parsed product manual PDFs",
)
def productmanuals_extract():
    sql = f"""
        ai_extract(
            {input_column},
            '{schema}',
            map('version', '2.0', 'instructions', '{instructions}')
        )
    """
    return (
        spark.readStream.table(input_table_path)
        .withColumn(output_column, F.expr(sql))
        .select("path", "file_name", "file_size", output_column)
    )
&lt;/LI-CODE&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;The full schema defines fields covering manufacturer info, performance specifications, physical dimensions, and accessory compatibility. Two design decisions shaped the schema and are worth highlighting for anyone building their own extraction pipeline:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c6"&gt;Schema and field descriptions act as extraction hints.&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;&amp;nbsp;Specifying "Use the hard screwdriving value if both hard and soft are given" for torque, or "If given in lbs, convert to kg" for weight, steers the LLM toward the right value with simple &lt;/SPAN&gt;&lt;SPAN class="c7"&gt;prompt engineering&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;. Giving more context by describing torque as "May appear as max torque, tightening torque, or fastening torque" helps the LLM find the right value even when manufacturers use different terminology for the same specification.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c6"&gt;The prompt instructs the AI function.&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;&amp;nbsp;In our case, some manuals contain content in multiple languages. Here, we added explicit instructions to extract only from English sections. We also instructed it to prioritize the specification tables in the documents.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c43"&gt;&lt;SPAN class="c7"&gt;The Agent Bricks UI can also serve as a guide when creating the ai_extract integration.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3 id="h.bfyn4stkwp4r" class="c38"&gt;&lt;SPAN class="c36"&gt;Step 3: Flatten and Type-Cast into a Clean Table&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;The final step flattens the result from &lt;/SPAN&gt;&lt;SPAN class="c19 c7"&gt;ai_extract&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;&amp;nbsp;into strongly typed columns with descriptive column comments. These comments are important because downstream Databricks components such as &lt;A class="c8" href="https://docs.databricks.com/aws/en/genie/index.html" target="_blank" rel="noopener"&gt;Genie Space&lt;/A&gt; leverage them to generate more accurate results. Those transformations can be generated using &lt;/SPAN&gt;&lt;SPAN class="c18 c7"&gt;&lt;A class="c8" href="https://www.databricks.com/blog/introducing-genie-code" target="_blank" rel="noopener"&gt;Genie Code&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;, an autonomous AI agent in Databricks designed for data engineering, science, and analytics&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;.&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from pyspark import pipelines as dp
import pyspark.sql.functions as F

# I/O variables
table_prefix = spark.conf.get("table")
input_table_path = f"{table_prefix}_productmanuals_extract"

# Config variables
schema = """
    file_name STRING COMMENT 'Original PDF file name.',
    manufacturer STRING COMMENT 'Brand or manufacturer name.',
    model_number STRING COMMENT 'Product model number or identifier.',
    product_name STRING COMMENT 'Full product name or description.',
    product_type STRING COMMENT 'Type of tool (drill, drill/driver, hammer drill, etc.).',
    rated_voltage_v DOUBLE COMMENT 'Rated voltage in volts.',
    ...
"""

@dp.table(
    name=f"{table_prefix}_productmanuals_processed",
    comment="Processed product catalog from power tool manuals: structured specifications for cross-vendor comparison, procurement intelligence, and product recommendation.",
    schema=schema,
)
def productmanuals_processed():
    return (
        spark.readStream.table(input_table_path)
        .select(
           F.col("file_name"),
           F.expr("ai_result:response.manufacturer::STRING").alias("manufacturer"),            
           F.expr("ai_result:response.model_number::STRING").alias("model_number"),
           F.expr("ai_result:response.product_name::STRING").alias("product_name"),
           F.expr("ai_result:response.product_type::STRING").alias("product_type"),
           F.expr("ai_result:response.rated_voltage_v::DOUBLE").alias("rated_voltage_v"),
         )
    )
&lt;/LI-CODE&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;All three stages are expressed as streaming tables, making the pipeline incremental end to end.&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;&amp;nbsp;Running the pipeline on the four product manuals produces a structured product catalog ready for analysis:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="image9.png" style="width: 874px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/26242i6E277736B8DE2A36/image-size/large?v=v2&amp;amp;px=999" role="button" title="image9.png" alt="image9.png" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c12 c17 c7"&gt;Figure 5: The extraction results table in the Databricks Apps, showing structured fields extracted from each PDF.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c5 c12 c7"&gt;These results illustrate an important practical consideration: extraction quality correlates directly with the quality of the source document. Professional-grade data sheets yield complete specifications, while consumer instruction manuals sometimes omit detailed technical data. The pipeline handles this appropriately, returning NULL where data is unavailable rather than generating fabricated values.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3 id="h.qhcuq1dkvogv" class="c38"&gt;&lt;SPAN class="c33 c5"&gt;Evaluating Quality with MLflow&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;In the absence of ground truth labels, how do we measure whether the extraction is performing well? We use &lt;/SPAN&gt;&lt;SPAN class="c18 c6"&gt;&lt;A class="c8" href="https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/" target="_blank" rel="noopener"&gt;MLflow 3 GenAI evaluation&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;&amp;nbsp;with a combination of code-based scorers (checking field completeness and numeric plausibility) and LLM-as-judge scorers (verifying that all values are in English and that extracted manufacturers and model numbers are valid). These four scorers run in a single &lt;/SPAN&gt;&lt;SPAN class="c7 c19"&gt;mlflow.genai.evaluate()&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;&amp;nbsp;call:&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;results = mlflow.genai.evaluate(
    data=eval_data,
    scorers=[completeness, format_validator, english_check, extraction_quality]
)
&lt;/LI-CODE&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c35"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="image6.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/26243iEE0745203E5125A4/image-size/large?v=v2&amp;amp;px=999" role="button" title="image6.png" alt="image6.png" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c12 c17 c7"&gt;Figure 6: MLflow evaluation results across four documents.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;Results are logged to an MLflow experiment, establishing a continuous feedback loop: refine the prompt, re-run the pipeline, re-run the evaluation, and compare metrics across runs. This follows the same experiment tracking methodology that data scientists use for traditional ML development, now applied to GenAI extraction pipelines.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H2 id="h.odq09thy68yz" class="c25"&gt;&lt;SPAN class="c29 c5"&gt;Delivering Results to Business Users&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c5 c12 c7"&gt;A structured table is only valuable if the right people can access it. We expose the extracted data through two complementary interfaces, each serving a different type of question.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Data Extraction (6).png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/26778iFAE43AE6B9695023/image-size/large?v=v2&amp;amp;px=999" role="button" title="Data Extraction (6).png" alt="Data Extraction (6).png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c12 c17 c7"&gt;Figure 7: End-to-end architecture, from PDF ingestion through extraction to productized interfaces on Genie (previously known as Databricks One).&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;The pipeline that parses, extracts, and post-processes the documents is orchestrated by &lt;/SPAN&gt;&lt;SPAN class="c18 c6"&gt;&lt;A class="c8" href="https://docs.databricks.com/aws/en/jobs/" target="_blank" rel="noopener"&gt;Lakeflow Jobs&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;. The entire solution is&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;&amp;nbsp;deployed via &lt;/SPAN&gt;&lt;SPAN class="c18 c6"&gt;&lt;A class="c8" href="https://docs.databricks.com/aws/en/dev-tools/bundles/" target="_blank" rel="noopener"&gt;Declarative Automation Bundles&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;&amp;nbsp;for CI/CD&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;A &lt;/SPAN&gt;&lt;SPAN class="c24 c6"&gt;&lt;A class="c8" href="https://docs.databricks.com/aws/en/genie/index.html" target="_blank" rel="noopener"&gt;Genie Space&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;&amp;nbsp;allows business users to ask natural-language questions about the structured product catalog without writing SQL. Questions like "Which drill has the highest max torque?" or "Compare all drills by weight and voltage" are translated into SQL and executed against the processed table. Because we invested in descriptive column comments during the processing step, Genie Space understands the semantics of each field and generates more accurate queries.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;Not all questions can be answered from the extracted fields. Safety instructions, maintenance procedures, troubleshooting steps, and warranty terms live in the original documents. A &lt;/SPAN&gt;&lt;SPAN class="c24 c6"&gt;&lt;A class="c8" href="https://docs.databricks.com/aws/en/generative-ai/agent-bricks/knowledge-assistant" target="_blank" rel="noopener"&gt;Knowledge Assistant&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;&amp;nbsp;built with Agent Bricks indexes the raw PDF manuals from the same Unity Catalog Volume and provides cited, document-grounded answers to these open-ended questions. The incremental refresh is also executed by &lt;/SPAN&gt;&lt;SPAN class="c18 c6"&gt;&lt;A class="c8" href="https://docs.databricks.com/aws/en/jobs/" target="_blank" rel="noopener"&gt;Lakeflow Jobs&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;, leveraging the &lt;/SPAN&gt;&lt;SPAN class="c18 c7"&gt;&lt;A class="c8" href="https://docs.databricks.com/api/workspace/knowledgeassistants" target="_blank" rel="noopener"&gt;Agent Bricks REST API&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;A &lt;/SPAN&gt;&lt;SPAN class="c24 c6"&gt;&lt;A class="c8" href="https://docs.databricks.com/aws/en/generative-ai/agent-bricks/multi-agent-supervisor" target="_blank" rel="noopener"&gt;Supervisor Agent&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;&amp;nbsp;ties the Genie Space and Knowledge Assistant into one interface, allowing users to ask questions about structured and unstructured data. It can be surfaced to end users using &lt;/SPAN&gt;&lt;SPAN class="c18 c6"&gt;&lt;A class="c8" href="https://docs.databricks.com/aws/en/dev-tools/databricks-apps/" target="_blank" rel="noopener"&gt;Databricks Apps&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;&amp;nbsp;as the user interface.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c48"&gt;&lt;SPAN class="c18"&gt;&lt;A class="c8" href="https://docs.databricks.com/aws/en/dev-tools/databricks-apps/" target="_blank" rel="noopener"&gt;Databricks Apps&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;enables developers to build and deploy secure data and AI applications directly on the Databricks platform, which eliminates the need for separate infrastructure. Apps run on the serverless platform and integrate with key platform services, including Unity Catalog for data governance, Databricks SQL for querying data, and OAuth for authentication. &lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="image3.png" style="width: 902px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/26246iED1DED7FC365BA86/image-size/large?v=v2&amp;amp;px=999" role="button" title="image3.png" alt="image3.png" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="c12 c17 c7"&gt;Figure 8: Databricks Apps provides a self-service interface for uploading PDFs and triggering the extraction pipeline.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="image2.png" style="width: 891px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/26247iA1E439CADBA8969D/image-size/large?v=v2&amp;amp;px=999" role="button" title="image2.png" alt="image2.png" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="c12 c7 c17"&gt;Figure 9: The Supervisor Agent responds to "Compare all drills by weight and voltage" with a structured table drawn from the extracted data.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c48"&gt;&lt;SPAN class="c18"&gt;&lt;A class="c8" href="https://docs.databricks.com/aws/en/workspace/genie" target="_blank" rel="noopener"&gt;Genie (previously known as Databricks One)&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;is a user interface designed for business users, giving them a single, intuitive entry point to interact with data and AI in Databricks. From there, users can ask data questions in natural language via Genie Spaces, interact with custom-built Databricks Apps that combine analytics, AI, and workflows, and view AI/BI dashboards to track KPIs and analyze metrics.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="image10.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/26248iE0E18C53B9B432D7/image-size/large?v=v2&amp;amp;px=999" role="button" title="image10.png" alt="image10.png" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c7"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="c12 c17 c7"&gt;Figure 10: Genie surfaces Genie Spaces, Apps, and dashboards as a single entry point for business users.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H2 id="h.bixn08dc9br2" class="c25"&gt;&lt;SPAN class="c5 c29"&gt;Key Takeaways&lt;/SPAN&gt;&lt;/H2&gt;
&lt;UL class="c13 lst-kix_n4ese0xvb2ra-0 start"&gt;
&lt;LI class="c14 li-bullet-0"&gt;&lt;SPAN class="c6"&gt;No training data required.&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;&amp;nbsp;Define what to extract through a declarative schema and start processing new document types immediately.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI class="c14 li-bullet-0"&gt;&lt;SPAN class="c6"&gt;Incremental by default.&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;&amp;nbsp;Streaming tables process only new documents on each run, making the pipeline production-ready from day one.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI class="c14 li-bullet-0"&gt;&lt;SPAN class="c6"&gt;End-to-end governance.&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;&amp;nbsp;Unity Catalog governs PDFs, extracted tables, Genie Spaces, and Knowledge Assistants under one access control model.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI class="c14 li-bullet-0"&gt;&lt;SPAN class="c6"&gt;Measurable quality.&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;&amp;nbsp;MLflow evaluation provides quantitative metrics to track extraction quality over time and continuously improve results.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 id="h.rg1rde4kfujp" class="c25"&gt;&lt;SPAN class="c29 c5"&gt;Applying This Architecture to Other Use Cases&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c5 c12 c7"&gt;The architecture we presented is not limited to product manuals. The same pattern applies to any scenario where structured data needs to be extracted from unstructured documents:&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL class="c13 lst-kix_4dys9z69anad-0 start"&gt;
&lt;LI class="c14 li-bullet-0"&gt;&lt;SPAN class="c6"&gt;Manufacturing:&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;&amp;nbsp;Extract component specifications from supplier data sheets for bill-of-materials automation&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI class="c14 li-bullet-0"&gt;&lt;SPAN class="c6"&gt;Healthcare:&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;&amp;nbsp;Pull structured fields from clinical trial reports or medical device documentation&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI class="c14 li-bullet-0"&gt;&lt;SPAN class="c6"&gt;Financial Services:&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;&amp;nbsp;Parse loan agreements, insurance policies, or regulatory filings&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI class="c14 li-bullet-0"&gt;&lt;SPAN class="c6"&gt;Retail:&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;&amp;nbsp;Build product catalogs from vendor-provided PDFs for marketplace onboarding&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c5 c12 c7"&gt;For organizations looking to extend this solution, here are natural next steps:&lt;/SPAN&gt;&lt;/P&gt;
&lt;OL class="c13 lst-kix_ut0mbkt3ha8j-0 start" start="1"&gt;
&lt;LI class="c14 li-bullet-0"&gt;&lt;SPAN class="c6"&gt;Scale to thousands of documents&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;&amp;nbsp;using file-arrival triggers on Lakeflow Jobs to process them automatically as they arrive.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI class="c14 li-bullet-0"&gt;&lt;SPAN class="c6"&gt;Connect external sources&lt;/SPAN&gt;&lt;SPAN class="c7"&gt;&amp;nbsp;with Lakeflow Connect to provision files from SharePoint or cloud storage into Unity Catalog Volumes automatically (&lt;/SPAN&gt;&lt;SPAN class="c7 c34"&gt;&lt;A class="c8" href="https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/sharepoint-source-setup-overview" target="_blank" rel="noopener"&gt;see SharePoint connector&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;).&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI class="c14 li-bullet-0"&gt;&lt;SPAN class="c6"&gt;Build dashboards&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;&amp;nbsp;with Databricks AI/BI for visual product comparison charts and specification heatmaps.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI class="c14 li-bullet-0"&gt;&lt;SPAN class="c6"&gt;Refine extraction continuously&lt;/SPAN&gt;&lt;SPAN class="c5 c12 c7"&gt;&amp;nbsp;using the MLflow evaluation feedback loop to track quality as document volumes grow.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;P class="c0"&gt;&lt;SPAN class="c5 c12 c7"&gt;Explore the &lt;A href="https://github.com/databricks-solutions/databricks-blogposts/tree/main/2026-04-Intelligent-document-processing" target="_self"&gt;Databricks Solution Accelerator for this solution on GitHub&lt;/A&gt;, or speak with your Databricks representative to start building your own AI-powered Intelligent Document Processing pipeline today.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="c0 c22"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class="c0 c22"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class="c3 c22"&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 08 May 2026 13:21:30 GMT</pubDate>
    <dc:creator>NikkTheGreek</dc:creator>
    <dc:date>2026-05-08T13:21:30Z</dc:date>
    <item>
      <title>Intelligent Document Processing for Data Extraction: Transforming Product Manuals into Insights</title>
      <link>https://community.databricks.com/t5/technical-blog/intelligent-document-processing-for-data-extraction-transforming/ba-p/153847</link>
      <description>&lt;UL&gt;
&lt;LI aria-level="1"&gt;&lt;STRONG&gt;Turn unstructured product manuals into structured, queryable data&lt;/STRONG&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;using Databricks AI Functions, with no custom model training or rigid templates required.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-level="1"&gt;&lt;STRONG&gt;Build a complete document intelligence pipeline&lt;/STRONG&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;that parses PDFs, extracts structured fields, evaluates quality, and exposes results through natural-language interfaces, all on a single platform.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-level="1"&gt;&lt;STRONG&gt;Address real extraction challenges&lt;/STRONG&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;such as inconsistent terminology across vendors, varying document formats, and differing levels of detail, using prompt engineering and declarative schemas.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Fri, 08 May 2026 13:21:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/technical-blog/intelligent-document-processing-for-data-extraction-transforming/ba-p/153847</guid>
      <dc:creator>NikkTheGreek</dc:creator>
      <dc:date>2026-05-08T13:21:30Z</dc:date>
    </item>
  </channel>
</rss>

