Databricks Community

js54123875 · ‎08-13-2024

Azure AI Document Intelligence | Microsoft Azure

Does anyone have experience ingesting outputs from Azure Document Intelligence and/or know of some guides on how best to ingest this data? Specifically we are looking to ingest tax form data that has been processed by Document Intelligence, but open to any patterns/examples.

Information that would be helpful:

Example code sets
How the data was modeled after ingestion
How to use the model id to determine if a schema has changed and how to handle that in the ingestion pipeline
etc.

Thanks!

florence023 · ‎08-14-2024

@js54123875 wrote:
Azure AI Document Intelligence | Microsoft Azure
Does anyone have experience ingesting outputs from Azure Document Intelligence and/or know of some guides on how best to ingest this data? Specifically we are looking to ingest tax form data that has been processed by Document Intelligence, but open to any patterns/examples. GMSocrates
Information that would be helpful:
Example code sets
How the data was modeled after ingestion
How to use the model id to determine if a schema has changed and how to handle that in the ingestion pipeline
etc.
Thanks!

Hello,

Hi there!

Ingesting outputs from Azure Document Intelligence, especially for tax form data, can be streamlined with the right approach. Here are some resources and tips to help you get started:

Example Code Sets
Azure Document Intelligence provides SDKs in various languages, including C#, Python, Java, and JavaScript. Here’s a basic example in Python to extract data from a tax form:

from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential

endpoint = "YOUR_FORM_RECOGNIZER_ENDPOINT"
key = "YOUR_FORM_RECOGNIZER_KEY"

document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))

with open("path/to/your/taxform.pdf", "rb") as f:
    poller = document_analysis_client.begin_analyze_document("prebuilt-tax.us.1040", document=f)
    result = poller.result()

for document in result.documents:
    for name, field in document.fields.items():
        print(f"{name}: {field.value}")

Data Modeling After Ingestion
Once the data is extracted, you can model it in a structured format such as JSON or a relational database. For example, you might create tables for different tax forms (e.g., W-2, 1099) with columns representing the extracted fields.

Handling Schema Changes
To handle schema changes, you can use the model ID to check for updates. Azure Document Intelligence provides versioning for its models, so you can compare the current model ID with the previous one to detect changes.

Here’s a conceptual approach:

Store the Model ID: Save the model ID used for each document processing.
Check for Updates: Periodically check if the model ID has changed.
Update Schema: If a change is detected, update your ingestion pipeline to accommodate the new schema.

Hope this will help you.
Best regards,
florence023

Ajay-Pandey · ‎08-14-2024

Thanks for sharing

Ajay Kumar Pandey

Retired_mod · ‎08-14-2024

Hi @js54123875, Thanks for reaching out! Please review the response and let us know if it answers your question. Your feedback is valuable to us and the community.

If the response resolves your issue, kindly mark it as the accepted solution. This will help close the thread and assist others with similar queries.

We appreciate your participation and are here if you need further assistance!