Can I Replicate Azure Document Intelligence's Custom Table Extraction in Databricks?

AlbertWang
Valued Contributor

I am using Azure Document Intelligence to get data from a table in a PDF file. The table's headers do not visually align with the values. Therefore, the standard and pre-built models cannot correctly read the data.

I have built a custom-trained Azure Document Intelligence model and can read the data perfectly. When I trained the model, I used the Azure Document Intelligence feature and first ran a layout scan of the PDF file. Then, I created a new table type field and manually labelled and aligned each value detected on the PDF to one cell in the table field. After adding 4 PDF files, I could train a reasonably good model.

I want to know whether I can do the same/similar thing on Databricks using only Databricks's features? Not using Azure Document Intelligence.

dkushari
Databricks Employee
Databricks Employee

Hi @AlbertWang, you can easily achieve this using AgenBricks - Information ExtractionYour PDFs will be converted to text using the ai_parse_document function and saved in a Databricks table. You can then create the agent using that text table to get the output in JSON format.