Databricks Community

Deloitte_DS · ‎08-16-2023

Hi,

I'm trying to install system level package "Poppler-utils" for the cluster. I added the following line to the init.sh script.

sudo apt-get -f -y install poppler-utils

I got the following error: PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

If I install the same line at the notebook level, I don't get this error.

Can anyone help me with this issue and how to install system level packages at the cluster level in init scripts?

Raghavan93513 · ‎01-21-2025

Hi Team,

If you use a single user cluster and use the below init script, it will work:

sudo rm -r /var/lib/apt/lists/*
sudo apt clean && sudo apt update --fix-missing -y
sudo apt-get install poppler-utils tesseract-ocr -y

But if you are using a shared cluster. This solution would not work.

RCA:

Libraries installed via init scripts are not available for user-defined functions (UDFs) in Shared mode clusters, as the UDFs execute in a sandboxed execution environment. Python UDFs will not support dependencies installed via init scripts, as those are not available in the sandbox environment.

We need init scripts for Poppler, especially in your case, as your code (for example, pdf2image) relies on Poppler’s command-line utilities.

Solution:

For single mode cluster: Use the current functionality.
For shared mode cluster: Users can consider using alternative Python libraries that provide similar functionality to poppler-utils. Two such libraries are pdfplumber and PyMuPDF.

PyMuPDF: Screenshot 1
Pdfplumber: Screenshot 2

Hope this helps you!

View solution in original post

Deloitte_DS · ‎08-18-2023

Hi Kaniz, I tried to include it in the init script but still it is showing the same error. The path I gave is "usr/bin". May I know how I can navigate to this path to check if my package is installed or not? Also want to know how i can navigate to databricks/bin/python? Also how to check the environment variables?

dheeraj-cir · ‎09-09-2024

use a personal cluster and use

!sudo apt-get update

and

!sudo apt-get install -y poppler-utils

Arunraja · ‎12-04-2024

I am using my personal cluster but still getting the same error

kbmv · ‎01-15-2025

Hi below worked for me, I created an init script for my compute with below code

sudo rm -r /var/lib/apt/lists/*
sudo apt clean && sudo apt update --fix-missing -y
sudo apt-get install poppler-utils tesseract-ocr -y

Raghavan93513 · ‎01-21-2025