cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

OCRmyPDF in Databricks

kro
New Contributor II

Hello,

Do any of you have experience with using OCRmyPDF in Databricks? I have tried to install it in various was with different versions, but my notebook keep crashing with the error:

The Python process exited with exit code 139 (SIGSEGV: Segmentation fault).

I have made a minimal code example to test it:

import ocrmypdf
import io

with open('noocr.pdf', 'rb') as f:
    pdf = f.read()

pdf_bytes = io.BytesIO(pdf)
out_bytes = io.BytesIO()

ocrmypdf.ocr(
    pdf_bytes,
    out_bytes,
    language='dan',
    output_type="pdf",
    optimize=0,
    fast_web_view=9999999,
    force_ocr=True
    )

I have tried to narrow down the issue and it seems to vary when it crashes. Often, it happens when Tesseract is called as a subprocess, but also when it attempts to save the PDF to a temporary folder.

While debugging, it seems to work without errors when executing the code through the debugger.

Any suggestions would be greatly appreciated!

2 REPLIES 2

Mykola_Melnyk
New Contributor III

@kro You can try to read pdf directly to the spark DataFrame using PDF DataSource . It extract text for digital and scanned pdfs.

df = spark.read.format("pdf") \
.option("imageType", "BINARY") \
.option("resolution", "200") \
.option("pagePerPartition", "2") \
.option("reader", "pdfBox") \
.load("path to the pdf file(s)")
root
 |-- path: string (nullable = true)
 |-- filename: string (nullable = true)
 |-- page_number: integer (nullable = true)
 |-- partition_number: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- image: struct (nullable = true)
 |    |-- path: string (nullable = true)
 |    |-- resolution: integer (nullable = true)
 |    |-- data: binary (nullable = true)
 |    |-- imageType: string (nullable = true)
 |    |-- exception: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |-- document: struct (nullable = true)
 |    |-- path: string (nullable = true)
 |    |-- text: string (nullable = true)
 |    |-- outputType: string (nullable = true)
 |    |-- bBoxes: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |    |    |-- score: float (nullable = true)
 |    |    |    |-- x: integer (nullable = true)
 |    |    |    |-- y: integer (nullable = true)
 |    |    |    |-- width: integer (nullable = true)
 |    |    |    |-- height: integer (nullable = true)
 |    |-- exception: string (nullable = true)
I'm developing document processing using Spark.

sridharplv
Contributor

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now