Hello,
Do any of you have experience with using OCRmyPDF in Databricks? I have tried to install it in various was with different versions, but my notebook keep crashing with the error:
The Python process exited with exit code 139 (SIGSEGV: Segmentation fault).
I have made a minimal code example to test it:
import ocrmypdf
import io
with open('noocr.pdf', 'rb') as f:
pdf = f.read()
pdf_bytes = io.BytesIO(pdf)
out_bytes = io.BytesIO()
ocrmypdf.ocr(
pdf_bytes,
out_bytes,
language='dan',
output_type="pdf",
optimize=0,
fast_web_view=9999999,
force_ocr=True
)
I have tried to narrow down the issue and it seems to vary when it crashes. Often, it happens when Tesseract is called as a subprocess, but also when it attempts to save the PDF to a temporary folder.
While debugging, it seems to work without errors when executing the code through the debugger.
Any suggestions would be greatly appreciated!