Hi all,
I am exploring Databricks services or components that could be considered equivalent to Azure Document Intelligence and Azure Content Understanding.
Our customer works with dozens of Excel and PDF files. These files follow multiple template types, and the formats may evolve over time. For example, some files contain data in a standard tabular structure, others use pivot-style Excel layouts, and some follow more complex or semi-structured formats.
We already have a Databricks license. Instead of relying on Azure Content Understanding, we would like to understand whether Databricks can be used to automatically infer file structures and extract the required values.
As an example, if “England” appears on the row axis and “20251205” appears as a column header in a pivot table, we would like to normalize this into a record such as:
20251205, England, sales_amount = 500,000 GBP.
I have also attached sample Excel templates, which represent several of the formats we receive. If we extract text from these Excel files and invoke the Databricks ai_parse_document function, I am not confident that the contextual meaning will be preserved. For instance, Column B represents the laboratory method used for experiments; however, this information is not explicitly labeled or defined within the Excel structure itself.
In addition, the ai_parse_document function does not support multiple languages.
I have reviewed other Databricks capabilities such as ai_query, ai_extract, and AgentBricks, but I am still uncertain which solution or combination of technologies would be the most appropriate fit for this use case.
Could you please advise how this requirement could be implemented using Databricks services or components?
Best regards,