OCRmyPDF in Databricks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-21-2024 01:34 AM
Hello,
Do any of you have experience with using OCRmyPDF in Databricks? I have tried to install it in various was with different versions, but my notebook keep crashing with the error:
The Python process exited with exit code 139 (SIGSEGV: Segmentation fault).I have made a minimal code example to test it:
import ocrmypdf
import io
with open('noocr.pdf', 'rb') as f:
pdf = f.read()
pdf_bytes = io.BytesIO(pdf)
out_bytes = io.BytesIO()
ocrmypdf.ocr(
pdf_bytes,
out_bytes,
language='dan',
output_type="pdf",
optimize=0,
fast_web_view=9999999,
force_ocr=True
)I have tried to narrow down the issue and it seems to vary when it crashes. Often, it happens when Tesseract is called as a subprocess, but also when it attempts to save the PDF to a temporary folder.
While debugging, it seems to work without errors when executing the code through the debugger.
Any suggestions would be greatly appreciated!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-15-2025 07:34 AM
@kro You can try to read pdf directly to the spark DataFrame using PDF DataSource . It extract text for digital and scanned pdfs.
df = spark.read.format("pdf") \
.option("imageType", "BINARY") \
.option("resolution", "200") \
.option("pagePerPartition", "2") \
.option("reader", "pdfBox") \
.load("path to the pdf file(s)")root
|-- path: string (nullable = true)
|-- filename: string (nullable = true)
|-- page_number: integer (nullable = true)
|-- partition_number: integer (nullable = true)
|-- text: string (nullable = true)
|-- image: struct (nullable = true)
| |-- path: string (nullable = true)
| |-- resolution: integer (nullable = true)
| |-- data: binary (nullable = true)
| |-- imageType: string (nullable = true)
| |-- exception: string (nullable = true)
| |-- height: integer (nullable = true)
| |-- width: integer (nullable = true)
|-- document: struct (nullable = true)
| |-- path: string (nullable = true)
| |-- text: string (nullable = true)
| |-- outputType: string (nullable = true)
| |-- bBoxes: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- text: string (nullable = true)
| | | |-- score: float (nullable = true)
| | | |-- x: integer (nullable = true)
| | | |-- y: integer (nullable = true)
| | | |-- width: integer (nullable = true)
| | | |-- height: integer (nullable = true)
| |-- exception: string (nullable = true)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-15-2025 09:44 AM
Refer to this link too https://community.databricks.com/t5/data-engineering/pdf-parsing-in-notebook/td-p/14636