cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

OCRmyPDF in Databricks

kro
New Contributor

Hello,

Do any of you have experience with using OCRmyPDF in Databricks? I have tried to install it in various was with different versions, but my notebook keep crashing with the error:

The Python process exited with exit code 139 (SIGSEGV: Segmentation fault).

I have made a minimal code example to test it:

import ocrmypdf
import io

with open('noocr.pdf', 'rb') as f:
    pdf = f.read()

pdf_bytes = io.BytesIO(pdf)
out_bytes = io.BytesIO()

ocrmypdf.ocr(
    pdf_bytes,
    out_bytes,
    language='dan',
    output_type="pdf",
    optimize=0,
    fast_web_view=9999999,
    force_ocr=True
    )

I have tried to narrow down the issue and it seems to vary when it crashes. Often, it happens when Tesseract is called as a subprocess, but also when it attempts to save the PDF to a temporary folder.

While debugging, it seems to work without errors when executing the code through the debugger.

Any suggestions would be greatly appreciated!

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group