โ08-16-2023 12:30 PM
Hi,
I'm trying to install system level package "Poppler-utils" for the cluster. I added the following line to the init.sh script.
sudo apt-get -f -y install poppler-utils
I got the following error: PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
If I install the same line at the notebook level, I don't get this error.
Can anyone help me with this issue and how to install system level packages at the cluster level in init scripts?
โ01-21-2025 08:08 PM
Hi Team,
If you use a single user cluster and use the below init script, it will work:
sudo rm -r /var/lib/apt/lists/*
sudo apt clean && sudo apt update --fix-missing -y
sudo apt-get install poppler-utils tesseract-ocr -y
But if you are using a shared cluster. This solution would not work.
RCA:
Libraries installed via init scripts are not available for user-defined functions (UDFs) in Shared mode clusters, as the UDFs execute in a sandboxed execution environment. Python UDFs will not support dependencies installed via init scripts, as those are not available in the sandbox environment.
We need init scripts for Poppler, especially in your case, as your code (for example, pdf2image) relies on Popplerโs command-line utilities.
Solution:
For single mode cluster: Use the current functionality.
For shared mode cluster: Users can consider using alternative Python libraries that provide similar functionality to poppler-utils. Two such libraries are pdfplumber and PyMuPDF.
PyMuPDF: Screenshot 1
Pdfplumber: Screenshot 2
Hope this helps you!
โ08-18-2023 09:36 AM
Hi Kaniz, I tried to include it in the init script but still it is showing the same error. The path I gave is "usr/bin". May I know how I can navigate to this path to check if my package is installed or not? Also want to know how i can navigate to databricks/bin/python? Also how to check the environment variables?
โ09-09-2024 05:07 AM
โ12-04-2024 02:26 PM
I am using my personal cluster but still getting the same error
โ01-15-2025 03:49 AM
Hi below worked for me, I created an init script for my compute with below code
sudo rm -r /var/lib/apt/lists/*
sudo apt clean && sudo apt update --fix-missing -y
sudo apt-get install poppler-utils tesseract-ocr -y
โ01-21-2025 08:08 PM
Hi Team,
If you use a single user cluster and use the below init script, it will work:
sudo rm -r /var/lib/apt/lists/*
sudo apt clean && sudo apt update --fix-missing -y
sudo apt-get install poppler-utils tesseract-ocr -y
But if you are using a shared cluster. This solution would not work.
RCA:
Libraries installed via init scripts are not available for user-defined functions (UDFs) in Shared mode clusters, as the UDFs execute in a sandboxed execution environment. Python UDFs will not support dependencies installed via init scripts, as those are not available in the sandbox environment.
We need init scripts for Poppler, especially in your case, as your code (for example, pdf2image) relies on Popplerโs command-line utilities.
Solution:
For single mode cluster: Use the current functionality.
For shared mode cluster: Users can consider using alternative Python libraries that provide similar functionality to poppler-utils. Two such libraries are pdfplumber and PyMuPDF.
PyMuPDF: Screenshot 1
Pdfplumber: Screenshot 2
Hope this helps you!
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now