<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic OCRmyPDF in Databricks in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/ocrmypdf-in-databricks/m-p/95134#M9356</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;Do any of you have experience with using OCRmyPDF in Databricks? I have tried to install it in various was with different versions, but my notebook keep crashing with the error:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;The Python process exited with exit code 139 (SIGSEGV: Segmentation fault).&lt;/LI-CODE&gt;&lt;P&gt;I have made a minimal code example to test it:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import ocrmypdf
import io

with open('noocr.pdf', 'rb') as f:
    pdf = f.read()

pdf_bytes = io.BytesIO(pdf)
out_bytes = io.BytesIO()

ocrmypdf.ocr(
    pdf_bytes,
    out_bytes,
    language='dan',
    output_type="pdf",
    optimize=0,
    fast_web_view=9999999,
    force_ocr=True
    )&lt;/LI-CODE&gt;&lt;P&gt;I have tried to narrow down the issue and it seems to vary when it crashes. Often, it happens when Tesseract is called as a subprocess, but also when it attempts to save the PDF to a temporary folder.&lt;/P&gt;&lt;P&gt;While debugging, it seems to work without errors when executing the code through the debugger.&lt;/P&gt;&lt;P&gt;Any suggestions would be greatly appreciated!&lt;/P&gt;</description>
    <pubDate>Mon, 21 Oct 2024 08:34:13 GMT</pubDate>
    <dc:creator>kro</dc:creator>
    <dc:date>2024-10-21T08:34:13Z</dc:date>
    <item>
      <title>OCRmyPDF in Databricks</title>
      <link>https://community.databricks.com/t5/get-started-discussions/ocrmypdf-in-databricks/m-p/95134#M9356</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;Do any of you have experience with using OCRmyPDF in Databricks? I have tried to install it in various was with different versions, but my notebook keep crashing with the error:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;The Python process exited with exit code 139 (SIGSEGV: Segmentation fault).&lt;/LI-CODE&gt;&lt;P&gt;I have made a minimal code example to test it:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import ocrmypdf
import io

with open('noocr.pdf', 'rb') as f:
    pdf = f.read()

pdf_bytes = io.BytesIO(pdf)
out_bytes = io.BytesIO()

ocrmypdf.ocr(
    pdf_bytes,
    out_bytes,
    language='dan',
    output_type="pdf",
    optimize=0,
    fast_web_view=9999999,
    force_ocr=True
    )&lt;/LI-CODE&gt;&lt;P&gt;I have tried to narrow down the issue and it seems to vary when it crashes. Often, it happens when Tesseract is called as a subprocess, but also when it attempts to save the PDF to a temporary folder.&lt;/P&gt;&lt;P&gt;While debugging, it seems to work without errors when executing the code through the debugger.&lt;/P&gt;&lt;P&gt;Any suggestions would be greatly appreciated!&lt;/P&gt;</description>
      <pubDate>Mon, 21 Oct 2024 08:34:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/ocrmypdf-in-databricks/m-p/95134#M9356</guid>
      <dc:creator>kro</dc:creator>
      <dc:date>2024-10-21T08:34:13Z</dc:date>
    </item>
    <item>
      <title>Re: OCRmyPDF in Databricks</title>
      <link>https://community.databricks.com/t5/get-started-discussions/ocrmypdf-in-databricks/m-p/115537#M9357</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/128276"&gt;@kro&lt;/a&gt;&amp;nbsp;You can try to read pdf directly to the spark DataFrame using &lt;A href="https://stabrise.com/spark-pdf/" target="_self"&gt;PDF DataSource&lt;/A&gt;&amp;nbsp;. It extract text for digital and scanned pdfs.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;df = spark.read.format("pdf") \
.option("imageType", "BINARY") \
.option("resolution", "200") \
.option("pagePerPartition", "2") \
.option("reader", "pdfBox") \
.load("path to the pdf file(s)")&lt;/LI-CODE&gt;&lt;LI-CODE lang="markup"&gt;root
 |-- path: string (nullable = true)
 |-- filename: string (nullable = true)
 |-- page_number: integer (nullable = true)
 |-- partition_number: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- image: struct (nullable = true)
 |    |-- path: string (nullable = true)
 |    |-- resolution: integer (nullable = true)
 |    |-- data: binary (nullable = true)
 |    |-- imageType: string (nullable = true)
 |    |-- exception: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |-- document: struct (nullable = true)
 |    |-- path: string (nullable = true)
 |    |-- text: string (nullable = true)
 |    |-- outputType: string (nullable = true)
 |    |-- bBoxes: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |    |    |-- score: float (nullable = true)
 |    |    |    |-- x: integer (nullable = true)
 |    |    |    |-- y: integer (nullable = true)
 |    |    |    |-- width: integer (nullable = true)
 |    |    |    |-- height: integer (nullable = true)
 |    |-- exception: string (nullable = true)&lt;/LI-CODE&gt;</description>
      <pubDate>Tue, 15 Apr 2025 14:34:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/ocrmypdf-in-databricks/m-p/115537#M9357</guid>
      <dc:creator>Mykola_Melnyk</dc:creator>
      <dc:date>2025-04-15T14:34:11Z</dc:date>
    </item>
    <item>
      <title>Re: OCRmyPDF in Databricks</title>
      <link>https://community.databricks.com/t5/get-started-discussions/ocrmypdf-in-databricks/m-p/115570#M9358</link>
      <description>&lt;P&gt;Refer to this link too &lt;A href="https://community.databricks.com/t5/data-engineering/pdf-parsing-in-notebook/td-p/14636" target="_blank"&gt;https://community.databricks.com/t5/data-engineering/pdf-parsing-in-notebook/td-p/14636&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 15 Apr 2025 16:44:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/ocrmypdf-in-databricks/m-p/115570#M9358</guid>
      <dc:creator>sridharplv</dc:creator>
      <dc:date>2025-04-15T16:44:51Z</dc:date>
    </item>
  </channel>
</rss>

