Databricks Community

kro · ‎10-21-2024

Hello,

Do any of you have experience with using OCRmyPDF in Databricks? I have tried to install it in various was with different versions, but my notebook keep crashing with the error:

The Python process exited with exit code 139 (SIGSEGV: Segmentation fault).

I have made a minimal code example to test it:

import ocrmypdf
import io

with open('noocr.pdf', 'rb') as f:
    pdf = f.read()

pdf_bytes = io.BytesIO(pdf)
out_bytes = io.BytesIO()

ocrmypdf.ocr(
    pdf_bytes,
    out_bytes,
    language='dan',
    output_type="pdf",
    optimize=0,
    fast_web_view=9999999,
    force_ocr=True
    )

I have tried to narrow down the issue and it seems to vary when it crashes. Often, it happens when Tesseract is called as a subprocess, but also when it attempts to save the PDF to a temporary folder.

While debugging, it seems to work without errors when executing the code through the debugger.

Any suggestions would be greatly appreciated!

Mykola_Melnyk · ‎04-15-2025

@kro You can try to read pdf directly to the spark DataFrame using PDF DataSource . It extract text for digital and scanned pdfs.

df = spark.read.format("pdf") \
.option("imageType", "BINARY") \
.option("resolution", "200") \
.option("pagePerPartition", "2") \
.option("reader", "pdfBox") \
.load("path to the pdf file(s)")

root
 |-- path: string (nullable = true)
 |-- filename: string (nullable = true)
 |-- page_number: integer (nullable = true)
 |-- partition_number: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- image: struct (nullable = true)
 |    |-- path: string (nullable = true)
 |    |-- resolution: integer (nullable = true)
 |    |-- data: binary (nullable = true)
 |    |-- imageType: string (nullable = true)
 |    |-- exception: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |-- document: struct (nullable = true)
 |    |-- path: string (nullable = true)
 |    |-- text: string (nullable = true)
 |    |-- outputType: string (nullable = true)
 |    |-- bBoxes: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |    |    |-- score: float (nullable = true)
 |    |    |    |-- x: integer (nullable = true)
 |    |    |    |-- y: integer (nullable = true)
 |    |    |    |-- width: integer (nullable = true)
 |    |    |    |-- height: integer (nullable = true)
 |    |-- exception: string (nullable = true)

I'm developing document processing using Spark.