cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Unable to install poppler-utils

Deloitte_DS
New Contributor II

Hi,

I'm trying to install system level package "Poppler-utils" for the cluster. I added the following line to the init.sh script.

sudo apt-get -f -y install poppler-utils

I got the following error: PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

If I install the same line at the notebook level, I don't get this error. 

Can anyone help me with this issue and how to install system level packages at the cluster level in init scripts? 

1 ACCEPTED SOLUTION

Accepted Solutions

Raghavan93513
Databricks Employee
Databricks Employee

Hi Team,

If you use a single user cluster and use the below init script, it will work:

sudo rm -r /var/lib/apt/lists/*
sudo apt clean && sudo apt update --fix-missing -y
sudo apt-get install poppler-utils tesseract-ocr -y

But if you are using a shared cluster. This solution would not work. 

RCA:

Libraries installed via init scripts are not available for user-defined functions (UDFs) in Shared mode clusters, as the UDFs execute in a sandboxed execution environment. Python UDFs will not support dependencies installed via init scripts, as those are not available in the sandbox environment.

We need init scripts for Poppler, especially in your case, as your code (for example, pdf2image) relies on Popplerโ€™s command-line utilities.

Solution:

For single mode cluster: Use the current functionality.
For shared mode cluster: Users can consider using alternative Python libraries that provide similar functionality to poppler-utils. Two such libraries are pdfplumber and PyMuPDF.

PyMuPDF: Screenshot 1
Pdfplumber: Screenshot 2

Hope this helps you!

 

View solution in original post

5 REPLIES 5

Hi Kaniz, I tried to include it in the init script but still it is showing the same error. The path I gave is "usr/bin". May I know how I can navigate to this path to check if my package is installed or not? Also want to know how i can navigate to databricks/bin/python? Also how to check the environment variables?

dheeraj-cir
New Contributor II
use a personal cluster and use
 
!sudo apt-get update
and
!sudo apt-get install -y poppler-utils

I am using my personal cluster but still getting the same error

kbmv
Contributor

Hi below worked for me, I created an init script for my compute with below code


sudo rm -r /var/lib/apt/lists/*
sudo apt clean && sudo apt update --fix-missing -y
sudo apt-get install poppler-utils tesseract-ocr -y

Raghavan93513
Databricks Employee
Databricks Employee

Hi Team,

If you use a single user cluster and use the below init script, it will work:

sudo rm -r /var/lib/apt/lists/*
sudo apt clean && sudo apt update --fix-missing -y
sudo apt-get install poppler-utils tesseract-ocr -y

But if you are using a shared cluster. This solution would not work. 

RCA:

Libraries installed via init scripts are not available for user-defined functions (UDFs) in Shared mode clusters, as the UDFs execute in a sandboxed execution environment. Python UDFs will not support dependencies installed via init scripts, as those are not available in the sandbox environment.

We need init scripts for Poppler, especially in your case, as your code (for example, pdf2image) relies on Popplerโ€™s command-line utilities.

Solution:

For single mode cluster: Use the current functionality.
For shared mode cluster: Users can consider using alternative Python libraries that provide similar functionality to poppler-utils. Two such libraries are pdfplumber and PyMuPDF.

PyMuPDF: Screenshot 1
Pdfplumber: Screenshot 2

Hope this helps you!