cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

ai_parse_document + Genie with ai_query

GiriSreerangam
New Contributor III

Hi Everyone

I have used ai_parse_document to process multiple PDFs and store the parsed data in a table (one PDF per row). Later, I ran ai_query in natural language, which correctly scans all rows and returns answers from each PDF.

However, when I use this parsed table in Genie Space, the responses only include a few matches. The underlying query (via ai_query) does return accurate results across all rows, but Genie doesnโ€™t seem to provide the complete result set, even after I tried giving it instructions.

Has anyone else experienced this behavior? Any suggestions would be greatly appreciated.

Regards,
Giri

5 REPLIES 5

stbjelcevic
Databricks Employee
Databricks Employee

Hi @GiriSreerangam ,

How many rows are in the results table, and how large are these PDF documents? Genieโ€™s UI and API may truncate large result sets, so it's possible you are seeing only a subset of matches even when the underlying SQL scans all rows.

As a quick follow-up, can you inspect the generated SQL in the response to see if Genie added any implicit LIMIT, filter, or top-k clause that reduces matches? If you find one, you can re-ask with explicit instructions or run the statement in the SQL editor to verify full results.

Raman_Unifeye
Contributor III

@GiriSreerangam - I faced the similar issue when I was running my Genie space against multiple parsed PDFs. However, later when i split the result set (logically as i had to extract the relevant text), the Genie results were accurate. I supsect it is due to the imlicit query truncation as mentioned by @stbjelcevic 

Please share your findings once you apply provided solution.


RG #Driving Business Outcomes with Data Intelligence

GiriSreerangam
New Contributor III

Thank you, @stbjelcevic and @Raman_Unifeye

I reviewed the generated SQL code and did not observe any limitations or truncations. A few additional details: Genie is returning all 14 rows (corresponding to my 14 PDFs), and they are displayed correctly in table format. Each row can contain valid data or answers to our question, while the remaining rows can be ignored.
However, when Genie summarizes the results, it does not appear to consider all 14 rows. My observation is that it scans only about 5 rows and bases the summary on those. When I explicitly ask it to include the other rows, the responses are sometimes accurate and sometimes not. I am also updating these findings in the instructions tab for reference. 

Hubert-Dudek
Esteemed Contributor III

I think Genie is not optimized for this use case. Please run some experiments - chunk pdf to use multi-agent supervisor from agent bricks to combine Genie with Knowledge Base (although I haven't yet it yet)


My blog: https://databrickster.medium.com/

thank you @Hubert-Dudek. Another approach of using Agent Bricks to build Agents on top of structured (via Genie) and unstructured is done. It is working as expected. I am looking for a solution using Genie as well. We are currently working on chunking, embedding and then use in Genie.