10-20-2021 02:30 AM
Trying to use pdf2image on databricks, but its failing with "PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?"
I've installed pdf2image & poppler-utils by running the following in a cell:
%pip install pdf2image
%pip install poppler-utils
But still hitting this error.
Can anyone help with my next step please?
(Rather annoyingly, this question was asked before at this url https://forums.databricks.com/questions/62529/pdf-to-image-using-poppler.html which I can see in google search results, but the databricks community forum seems to have changed format. If anyone knows how to translate from the old forum urls to new that would also be appreciated!)
Thanks
10-20-2021 02:34 AM
PS: I've also tried installing pdf2image & poppler-utils into the libraries on the cluster, but still hitting same issue
10-20-2021 03:01 AM
@Kaniz Fatma , would you be able to import the forum content here?
https://forums.databricks.com/questions/62529/pdf-to-image-using-poppler.html
10-20-2021 03:23 AM
There were no replies on https://forums.databricks.com/questions/62529/pdf-to-image-using-poppler.html (I see in cache)
10-20-2021 03:30 AM
try to modify poppler_path option. Try \usr\bin or \usr\local\bin or just space " ". If not work please check cluster environment variables where poppler is installed. Example:
convert_from_path(file, 200, poppler_path='\usr\bin')
10-20-2021 07:08 AM
Thanks for the suggestion HubertDudek. Unfortunately after a few hours attempting to get this running with your path suggestion I've given up & moved the convert from pdf-->png to another part of the data pipeline.
10-26-2022 06:33 PM
I couldn't figure out how to get pdf2image working either, instead i installed magick and converted pdf to png, then read in with matplotlib
%sh
cd /opt
wget https://imagemagick.org/archive/binaries/magick
chmod 777 magick
%sh
/opt/magick convert $vcfFiles/$sample_id.pdf $vcfFiles/$sample_id.png
import matplotlib.pyplot as plt
import matplotlib.image as img
im = img.imread(vcfFiles + sample_id + ".png")
plt.imshow(im)
display()
cheers
03-27-2024 12:34 PM
Seems like this thread has died, but for posterity, databricks provides the following code for installing poppler on a cluster. The code is sourced from the dbdemos accelerators, specifically the "LLM Chatbot With Retrieval Augmented Generation (RAG) and Llama 2 70B" (https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html#) demo. In the 01-PDF-Advanced-Data-Preparation notebook there's code to remote execute the 00-init-advanced notebook and in that notebook, you'll find the code below.
#install poppler on the cluster (should be done by init scripts)
def install_ocr_on_nodes():
"""
install poppler on the cluster (should be done by init scripts)
"""
# from pyspark.sql import SparkSession
import subprocess
num_workers = max(1,int(spark.conf.get("spark.databricks.clusterUsageTags.clusterWorkers")))
command = "sudo rm -rf /var/cache/apt/archives/* /var/lib/apt/lists/* && sudo apt-get clean && sudo apt-get update && sudo apt-get install poppler-utils tesseract-ocr -y"
def run_subprocess(command):
try:
output = subprocess.check_output(command, stderr=subprocess.STDOUT, shell=True)
return output.decode()
except subprocess.CalledProcessError as e:
raise Exception("An error occurred installing OCR libs:"+ e.output.decode())
#install on the driver
run_subprocess(command)
def run_command(iterator):
for x in iterator:
yield run_subprocess(command)
# spark = SparkSession.builder.getOrCreate()
data = spark.sparkContext.parallelize(range(num_workers), num_workers)
# Use mapPartitions to run command in each partition (worker)
output = data.mapPartitions(run_command)
try:
output.collect();
print("OCR libraries installed")
except Exception as e:
print(f"Couldn't install on all node: {e}")
raise e
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group