Databricks

dg · ‎10-20-2021

Trying to use pdf2image on databricks, but its failing with "PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?"

I've installed pdf2image & poppler-utils by running the following in a cell:

%pip install pdf2image

%pip install poppler-utils

But still hitting this error.

Can anyone help with my next step please?

(Rather annoyingly, this question was asked before at this url https://forums.databricks.com/questions/62529/pdf-to-image-using-poppler.html which I can see in google search results, but the databricks community forum seems to have changed format. If anyone knows how to translate from the old forum urls to new that would also be appreciated!)

Thanks

dg · ‎10-20-2021

PS: I've also tried installing pdf2image & poppler-utils into the libraries on the cluster, but still hitting same issue

Prabakar · ‎10-20-2021

@Kaniz Fatma , would you be able to import the forum content here?

https://forums.databricks.com/questions/62529/pdf-to-image-using-poppler.html

Hubert-Dudek · ‎10-20-2021

There were no replies on https://forums.databricks.com/questions/62529/pdf-to-image-using-poppler.html (I see in cache)

Hubert-Dudek · ‎10-20-2021

try to modify poppler_path option. Try \usr\bin or \usr\local\bin or just space " ". If not work please check cluster environment variables where poppler is installed. Example:

convert_from_path(file, 200, poppler_path='\usr\bin')

dg · ‎10-20-2021

Thanks for the suggestion HubertDudek. Unfortunately after a few hours attempting to get this running with your path suggestion I've given up & moved the convert from pdf-->png to another part of the data pipeline.

wbrandler · ‎10-26-2022

I couldn't figure out how to get pdf2image working either, instead i installed magick and converted pdf to png, then read in with matplotlib

%sh
cd /opt
wget https://imagemagick.org/archive/binaries/magick
chmod 777 magick

%sh
/opt/magick convert $vcfFiles/$sample_id.pdf $vcfFiles/$sample_id.png

import matplotlib.pyplot as plt
import matplotlib.image as img
  
im = img.imread(vcfFiles +  sample_id + ".png")
plt.imshow(im)
display()

cheers

Slalom_Tobias · a month ago

Seems like this thread has died, but for posterity, databricks provides the following code for installing poppler on a cluster. The code is sourced from the dbdemos accelerators, specifically the "LLM Chatbot With Retrieval Augmented Generation (RAG) and Llama 2 70B" (https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html#) demo. In the 01-PDF-Advanced-Data-Preparation notebook there's code to remote execute the 00-init-advanced notebook and in that notebook, you'll find the code below.

#install poppler on the cluster (should be done by init scripts)
def install_ocr_on_nodes():
    """
    install poppler on the cluster (should be done by init scripts)
    """
    # from pyspark.sql import SparkSession
    import subprocess
    num_workers = max(1,int(spark.conf.get("spark.databricks.clusterUsageTags.clusterWorkers")))
    command = "sudo rm -rf /var/cache/apt/archives/* /var/lib/apt/lists/* && sudo apt-get clean && sudo apt-get update && sudo apt-get install poppler-utils tesseract-ocr -y" 
    def run_subprocess(command):
        try:
            output = subprocess.check_output(command, stderr=subprocess.STDOUT, shell=True)
            return output.decode()
        except subprocess.CalledProcessError as e:
            raise Exception("An error occurred installing OCR libs:"+ e.output.decode())
    #install on the driver
    run_subprocess(command)
    def run_command(iterator):
        for x in iterator:
            yield run_subprocess(command)
    # spark = SparkSession.builder.getOrCreate()
    data = spark.sparkContext.parallelize(range(num_workers), num_workers) 
    # Use mapPartitions to run command in each partition (worker)
    output = data.mapPartitions(run_command)
    try:
        output.collect();
        print("OCR libraries installed")
    except Exception as e:
        print(f"Couldn't install on all node: {e}")
        raise e

Databricks

Trying to use pdf2image on databricks

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs