Raghavan93513
Databricks Employee
Databricks Employee

Hi Team,

If you use a single user cluster and use the below init script, it will work:

sudo rm -r /var/lib/apt/lists/*
sudo apt clean && sudo apt update --fix-missing -y
sudo apt-get install poppler-utils tesseract-ocr -y

But if you are using a shared cluster. This solution would not work. 

RCA:

Libraries installed via init scripts are not available for user-defined functions (UDFs) in Shared mode clusters, as the UDFs execute in a sandboxed execution environment. Python UDFs will not support dependencies installed via init scripts, as those are not available in the sandbox environment.

We need init scripts for Poppler, especially in your case, as your code (for example, pdf2image) relies on Poppler’s command-line utilities.

Solution:

For single mode cluster: Use the current functionality.
For shared mode cluster: Users can consider using alternative Python libraries that provide similar functionality to poppler-utils. Two such libraries are pdfplumber and PyMuPDF.

PyMuPDF: Screenshot 1
Pdfplumber: Screenshot 2

Hope this helps you!

 

View solution in original post