Hello,
Do any of you have experience with using OCRmyPDF in Databricks? I have tried to install it in various was with different versions, but my notebook keep crashing with the error:
The Python process exited with exit code 139 (SIGSEGV: Segmentation fault).
I have made a minimal code example to test it:
import ocrmypdf
import io
with open('noocr.pdf', 'rb') as f:
    pdf = f.read()
pdf_bytes = io.BytesIO(pdf)
out_bytes = io.BytesIO()
ocrmypdf.ocr(
    pdf_bytes,
    out_bytes,
    language='dan',
    output_type="pdf",
    optimize=0,
    fast_web_view=9999999,
    force_ocr=True
    )
I have tried to narrow down the issue and it seems to vary when it crashes. Often, it happens when Tesseract is called as a subprocess, but also when it attempts to save the PDF to a temporary folder.
While debugging, it seems to work without errors when executing the code through the debugger.
Any suggestions would be greatly appreciated!