@js54123875 wrote:
Azure AI Document Intelligence | Microsoft Azure
Does anyone have experience ingesting outputs from Azure Document Intelligence and/or know of some guides on how best to ingest this data? Specifically we are looking to ingest tax form data that has been processed by Document Intelligence, but open to any patterns/examples. GMSocrates
Information that would be helpful:
- Example code sets
- How the data was modeled after ingestion
- How to use the model id to determine if a schema has changed and how to handle that in the ingestion pipeline
- etc.
Thanks!
Hello,
Hi there!
Ingesting outputs from Azure Document Intelligence, especially for tax form data, can be streamlined with the right approach. Here are some resources and tips to help you get started:
Example Code Sets
Azure Document Intelligence provides SDKs in various languages, including C#, Python, Java, and JavaScript. Hereโs a basic example in Python to extract data from a tax form:
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
endpoint = "YOUR_FORM_RECOGNIZER_ENDPOINT"
key = "YOUR_FORM_RECOGNIZER_KEY"
document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))
with open("path/to/your/taxform.pdf", "rb") as f:
poller = document_analysis_client.begin_analyze_document("prebuilt-tax.us.1040", document=f)
result = poller.result()
for document in result.documents:
for name, field in document.fields.items():
print(f"{name}: {field.value}")
Data Modeling After Ingestion
Once the data is extracted, you can model it in a structured format such as JSON or a relational database. For example, you might create tables for different tax forms (e.g., W-2, 1099) with columns representing the extracted fields.
Handling Schema Changes
To handle schema changes, you can use the model ID to check for updates. Azure Document Intelligence provides versioning for its models, so you can compare the current model ID with the previous one to detect changes.
Hereโs a conceptual approach:
Store the Model ID: Save the model ID used for each document processing.
Check for Updates: Periodically check if the model ID has changed.
Update Schema: If a change is detected, update your ingestion pipeline to accommodate the new schema.
Hope this will help you.
Best regards,
florence023