topic OCRmyPDF in Databricks in Get Started Discussions

OCRmyPDF in Databricks

kro — Mon, 21 Oct 2024 08:34:13 GMT

Hello,

Do any of you have experience with using OCRmyPDF in Databricks? I have tried to install it in various was with different versions, but my notebook keep crashing with the error:

The Python process exited with exit code 139 (SIGSEGV: Segmentation fault).

I have made a minimal code example to test it:

import ocrmypdf import io with open('noocr.pdf', 'rb') as f: pdf = f.read() pdf_bytes = io.BytesIO(pdf) out_bytes = io.BytesIO() ocrmypdf.ocr( pdf_bytes, out_bytes, language='dan', output_type="pdf", optimize=0, fast_web_view=9999999, force_ocr=True )

I have tried to narrow down the issue and it seems to vary when it crashes. Often, it happens when Tesseract is called as a subprocess, but also when it attempts to save the PDF to a temporary folder.

While debugging, it seems to work without errors when executing the code through the debugger.

Any suggestions would be greatly appreciated!

Re: OCRmyPDF in Databricks

Mykola_Melnyk — Tue, 15 Apr 2025 14:34:11 GMT

@kro You can try to read pdf directly to the spark DataFrame using PDF DataSource . It extract text for digital and scanned pdfs.

df = spark.read.format("pdf") \ .option("imageType", "BINARY") \ .option("resolution", "200") \ .option("pagePerPartition", "2") \ .option("reader", "pdfBox") \ .load("path to the pdf file(s)")

Re: OCRmyPDF in Databricks

sridharplv — Tue, 15 Apr 2025 16:44:51 GMT

Refer to this link too https://community.databricks.com/t5/data-engineering/pdf-parsing-in-notebook/td-p/14636