2 weeks ago
Hi all,
I am exploring Databricks services or components that could be considered equivalent to Azure Document Intelligence and Azure Content Understanding.
Our customer works with dozens of Excel and PDF files. These files follow multiple template types, and the formats may evolve over time. For example, some files contain data in a standard tabular structure, others use pivot-style Excel layouts, and some follow more complex or semi-structured formats.
We already have a Databricks license. Instead of relying on Azure Content Understanding, we would like to understand whether Databricks can be used to automatically infer file structures and extract the required values.
As an example, if โEnglandโ appears on the row axis and โ20251205โ appears as a column header in a pivot table, we would like to normalize this into a record such as:
20251205, England, sales_amount = 500,000 GBP.
I have also attached sample Excel templates, which represent several of the formats we receive. If we extract text from these Excel files and invoke the Databricks ai_parse_document function, I am not confident that the contextual meaning will be preserved. For instance, Column B represents the laboratory method used for experiments; however, this information is not explicitly labeled or defined within the Excel structure itself.
In addition, the ai_parse_document function does not support multiple languages.
I have reviewed other Databricks capabilities such as ai_query, ai_extract, and AgentBricks, but I am still uncertain which solution or combination of technologies would be the most appropriate fit for this use case.
Could you please advise how this requirement could be implemented using Databricks services or components?
Best regards,
a week ago
Hi,
I think the functionality you need could be covered with a few different functionalites. The first being the AI_Parse_Document function https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_parse_document
Here you'd need to specify the format of the output in a JSON file. If you want the JSON format to be automatically inferred then you could also try the information extraction agent here- https://docs.databricks.com/aws/en/generative-ai/agent-bricks/key-info-extraction
Note they both have different regional availability and the information extraction agent is in Beta.
a week ago
Thank you, Emma.
I reviewed the documentation but was unable to find this information. How can I determine the regional availability of these technologies? Is there an official Databricks reference or link that I can share with the relevant stakeholders?
a week ago
HI, agent bricks which information extraction is part of is https://docs.databricks.com/aws/en/generative-ai/agent-bricks/ which is only available in US regions at the moment. The AI_parse_document function is available in the EU on AWS but not azure currently. This is the best link https://docs.databricks.com/aws/en/resources/feature-region-support
Azure version - https://learn.microsoft.com/en-us/azure/databricks/resources/feature-region-support
For the AI parse document you are looking at the AI function batch inferance column.
a week ago
Thank you, Emma.
The ai_parse_document function does not support Excel files. Is there a Databricks-recommended approach or best practice to overcome this limitation when processing Excel documents?
a week ago
Hi, I think this is probably the easiest approach, it's in Beta at the moment, so a workspace admin would need to turn it on https://docs.databricks.com/aws/en/query/formats/excel
a week ago
We are working with a large number of Excel files in different formats (100+). While some files contain simple tabular structures with single-row headers, others use pivot-style layouts or more complex structures with embedded text and multi-row headers. These formats also evolve over time.
Would the Databricks Excel reader approach described here (https://docs.databricks.com/aws/en/query/formats/excel ) be sufficient to handle this level of variability?
a week ago
I would work, but you will need to specify and manage the ranges or number of header rows manually. You could potentially read the whole sheet in and then write some code that identifies the range that is of interest and cleans it in parsing. My recommendation would be to test it out on your actual data.
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now