<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: ai_parse_document struggling to detect pdf in Generative AI</title>
    <link>https://community.databricks.com/t5/generative-ai/ai-parse-document-struggling-to-detect-pdf/m-p/127316#M1077</link>
    <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/78221"&gt;@JN_Bristol&lt;/a&gt;&amp;nbsp;Can you describe the PDF document (size, contents) or share it? I have mixed experience with ai_parse&lt;/P&gt;</description>
    <pubDate>Mon, 04 Aug 2025 11:03:37 GMT</pubDate>
    <dc:creator>Sharanya13</dc:creator>
    <dc:date>2025-08-04T11:03:37Z</dc:date>
    <item>
      <title>ai_parse_document struggling to detect pdf</title>
      <link>https://community.databricks.com/t5/generative-ai/ai-parse-document-struggling-to-detect-pdf/m-p/127244#M1072</link>
      <description>&lt;P&gt;Hi helpful experts &lt;span class="lia-unicode-emoji" title=":glowing_star:"&gt;🌟&lt;/span&gt;&lt;/P&gt;&lt;P&gt;I'm writing my first PySpark Notebook that makes use of the new `ai_parse_document` function.&amp;nbsp; I am basically following the code example from here:&amp;nbsp;&lt;A href="https://learn.microsoft.com/en-gb/azure/databricks/sql/language-manual/functions/ai_parse_document" target="_blank"&gt;https://learn.microsoft.com/en-gb/azure/databricks/sql/language-manual/functions/ai_parse_document&lt;/A&gt;&lt;/P&gt;&lt;P&gt;(and doing it on Azure Databricks, if that helps)&lt;/P&gt;&lt;P&gt;My code:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from pyspark.sql.functions import ai_parse_document

volume_path = '/Volumes/gen_ai/bank_statements/raw_pdfs'

raw_pdfs = (
    spark.read
    .format('binaryFile')
    .load(f'{volume_path}/*.pdf')
)

# this line works fine... I can see 'length' = 332159 and 'content' is binary
raw_pdfs.display()

# this line runs ok... but the output is in the 'corrupted_data' property
parsed_pdfs = (
    raw_pdfs
    .withColumn(
        'content_parsed',
        ai_parse_document('content')
    )
)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;The error message is:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;error_message: "[UNSTRUCTURED_DATA_PROCESSING_UNSUPPORTED_FILE_FORMAT] Unstructured file format detected: unknown is not supported. Supported file formats are auto, pdf, doc, docx, ppt, pptx, png, jpg, jpeg.\nPlease update the `format` from your ai function expression to one of the supported formats and then retry the query again. SQLSTATE: 0A000"&lt;/LI-CODE&gt;&lt;P&gt;And yet the file _is_ a PDF.&amp;nbsp; I downloaded it from my bank, and can open it fine in Acrobat and other tools.&amp;nbsp; So I don't think it's the file that can be corrupted? &lt;span class="lia-unicode-emoji" title=":thinking_face:"&gt;🤔&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Does anyone know what the error message means by "update the format from your ai function expression"?&amp;nbsp; I can't see a parameter for that in the ai_parse_document documentation.&lt;/P&gt;&lt;P&gt;Alternatively, are there some PDFs that this (beta) function just can't handle yet?&lt;/P&gt;&lt;P&gt;Any advice much appreciated &lt;span class="lia-unicode-emoji" title=":folded_hands:"&gt;🙏🏻&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 03 Aug 2025 11:43:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/ai-parse-document-struggling-to-detect-pdf/m-p/127244#M1072</guid>
      <dc:creator>JN_Bristol</dc:creator>
      <dc:date>2025-08-03T11:43:21Z</dc:date>
    </item>
    <item>
      <title>Re: ai_parse_document struggling to detect pdf</title>
      <link>https://community.databricks.com/t5/generative-ai/ai-parse-document-struggling-to-detect-pdf/m-p/127266#M1074</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/78221"&gt;@JN_Bristol&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;There are some limitations while using&lt;SPAN&gt;&amp;nbsp;&lt;STRONG&gt;`ai_parse_document`&lt;/STRONG&gt; function:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;1.)&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;While Databricks is continuously working to improve all of its features, LLMs are an emerging technology and may produce errors.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;2.)&lt;STRONG&gt; The&amp;nbsp;&lt;CODE&gt;ai_parse_document&lt;/CODE&gt;&amp;nbsp;function can take time to extract document content while preserving structural information, especially for documents that contain highly dense content or content with poor resolution. In some cases, the function may take a while to run or ignore content. Databricks is continuously working to improve latency.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;I am sharing official documentation for your reference:&amp;nbsp;&lt;A href="https://learn.microsoft.com/en-gb/azure/databricks/sql/language-manual/functions/ai_parse_document#limitations" target="_self"&gt;https://learn.microsoft.com/en-gb/azure/databricks/sql/language-manual/functions/ai_parse_document#limitations&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Suggestion:&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Also y&lt;SPAN&gt;our input data files must be stored as &lt;STRONG&gt;blob data in bytes&lt;/STRONG&gt;, &lt;STRONG&gt;meaning a binary type column in a dataframe or Delta table.&lt;/STRONG&gt;&amp;nbsp;As your source documents are stored in a Unity Catalog volume, can you generate binary type column&amp;nbsp; using Spark&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;binaryFile&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;format reader?&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;I am sharing official documentation for your reference:&amp;nbsp;&lt;A href="https://learn.microsoft.com/en-gb/azure/databricks/sql/language-manual/functions/ai_parse_document#-supported-input-file-formats" target="_self"&gt;https://learn.microsoft.com/en-gb/azure/databricks/sql/language-manual/functions/ai_parse_document#-supported-input-file-formats&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 04 Aug 2025 04:53:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/ai-parse-document-struggling-to-detect-pdf/m-p/127266#M1074</guid>
      <dc:creator>Vinay_M_R</dc:creator>
      <dc:date>2025-08-04T04:53:52Z</dc:date>
    </item>
    <item>
      <title>Re: ai_parse_document struggling to detect pdf</title>
      <link>https://community.databricks.com/t5/generative-ai/ai-parse-document-struggling-to-detect-pdf/m-p/127316#M1077</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/78221"&gt;@JN_Bristol&lt;/a&gt;&amp;nbsp;Can you describe the PDF document (size, contents) or share it? I have mixed experience with ai_parse&lt;/P&gt;</description>
      <pubDate>Mon, 04 Aug 2025 11:03:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/ai-parse-document-struggling-to-detect-pdf/m-p/127316#M1077</guid>
      <dc:creator>Sharanya13</dc:creator>
      <dc:date>2025-08-04T11:03:37Z</dc:date>
    </item>
    <item>
      <title>Re: ai_parse_document struggling to detect pdf</title>
      <link>https://community.databricks.com/t5/generative-ai/ai-parse-document-struggling-to-detect-pdf/m-p/127376#M1080</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/139873"&gt;@Sharanya13&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It's an &lt;EM&gt;actual&lt;/EM&gt; bank statement (not dummy data)... so, alas no, I cannot share it &lt;span class="lia-unicode-emoji" title=":neutral_face:"&gt;😐&lt;/span&gt;&amp;nbsp; It's 6 pages, and contains a mixture of tables, graphics, and summary small print.&lt;/P&gt;&lt;P&gt;Are you suggesting I try "ai_parse" instead of "ai_parse_document"?&amp;nbsp; ok, I'll give that a go &lt;span class="lia-unicode-emoji" title=":folded_hands:"&gt;🙏🏻&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Thanks &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 04 Aug 2025 20:19:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/ai-parse-document-struggling-to-detect-pdf/m-p/127376#M1080</guid>
      <dc:creator>JN_Bristol</dc:creator>
      <dc:date>2025-08-04T20:19:03Z</dc:date>
    </item>
    <item>
      <title>Re: ai_parse_document struggling to detect pdf</title>
      <link>https://community.databricks.com/t5/generative-ai/ai-parse-document-struggling-to-detect-pdf/m-p/127377#M1081</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/76894"&gt;@Vinay_M_R&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for replying.&amp;nbsp; The docs link is the same as the link that I included in my original post - and it is where I am following the code examples from.&amp;nbsp; That example shows a pdf being read from a Volume - but are you saying I should not do this and should read directly from a Blob store instead? &lt;span class="lia-unicode-emoji" title=":thinking_face:"&gt;🤔&lt;/span&gt;&amp;nbsp; I thought the Databricks position was that Volumes are the way forward?&lt;/P&gt;</description>
      <pubDate>Mon, 04 Aug 2025 20:21:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/ai-parse-document-struggling-to-detect-pdf/m-p/127377#M1081</guid>
      <dc:creator>JN_Bristol</dc:creator>
      <dc:date>2025-08-04T20:21:21Z</dc:date>
    </item>
    <item>
      <title>Re: ai_parse_document struggling to detect pdf</title>
      <link>https://community.databricks.com/t5/generative-ai/ai-parse-document-struggling-to-detect-pdf/m-p/141165#M1501</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Hello&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://community.databricks.com/t5/user/viewprofilepage/user-id/78221" target="_blank" rel="noopener"&gt;@JN_Bristol&lt;/A&gt;,&lt;/P&gt;&lt;P&gt;I discovered that ai_parse_document only works when the input is parsed as &lt;STRONG&gt;real Python bytes&lt;/STRONG&gt;.&lt;BR /&gt;The binaryFile format in Spark returns the content as an internal binary type (like a memoryview), and ai_parse_document can’t process that directly.&lt;BR /&gt;By using a UDF to convert the data into actual bytes, the function starts working correctly.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from pyspark.sql.functions import ai_parse_document
import pyspark.sql.functions as F
from pyspark.sql.types import BinaryType
import base64
from io import BytesIO

def conversor(content):
    pdf_bytes = base64.b64decode(content)
    pdf_file_like_object = BytesIO(pdf_bytes)
    return pdf_file_like_object.read()


conversor_udf = F.udf(conversor, BinaryType())


volume_path = '/Volumes/catalog/schema/volumn/'

raw_pdfs = (
    spark.read
    .format('binaryFile')
    .load(f'{volume_path}/*.pdf')
).limit(1)


display(raw_pdfs)

parsed_pdfs = (
    raw_pdfs.withColumn(
        'content_bin',conversor_udf('content')
    )
    .withColumn(
        'content_parsed',
        ai_parse_document('content_bin')
    )
)
display(parsed_pdfs)&lt;/LI-CODE&gt;</description>
      <pubDate>Thu, 04 Dec 2025 14:00:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/ai-parse-document-struggling-to-detect-pdf/m-p/141165#M1501</guid>
      <dc:creator>lucaperes</dc:creator>
      <dc:date>2025-12-04T14:00:45Z</dc:date>
    </item>
    <item>
      <title>Re: ai_parse_document struggling to detect pdf</title>
      <link>https://community.databricks.com/t5/generative-ai/ai-parse-document-struggling-to-detect-pdf/m-p/141343#M1507</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/50639"&gt;@luca&lt;/a&gt;&amp;nbsp;wow!!&amp;nbsp; Thanks for this - that's exactly the code snippet I needed &lt;span class="lia-unicode-emoji" title=":smiling_face_with_smiling_eyes:"&gt;😊&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Kudos very well earned &lt;span class="lia-unicode-emoji" title=":folded_hands:"&gt;🙏🏻&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 07 Dec 2025 14:50:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/ai-parse-document-struggling-to-detect-pdf/m-p/141343#M1507</guid>
      <dc:creator>JN_Bristol</dc:creator>
      <dc:date>2025-12-07T14:50:17Z</dc:date>
    </item>
  </channel>
</rss>

