cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Trying to use pdf2image on databricks

dg
New Contributor II

Trying to use pdf2image on databricks, but its failing with "PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?"

I've installed pdf2image & poppler-utils by running the following in a cell:

%pip install pdf2image

%pip install poppler-utils

But still hitting this error.

Can anyone help with my next step please?

(Rather annoyingly, this question was asked before at this url https://forums.databricks.com/questions/62529/pdf-to-image-using-poppler.html which I can see in google search results, but the databricks community forum seems to have changed format. If anyone knows how to translate from the old forum urls to new that would also be appreciated!)

Thanks

7 REPLIES 7

dg
New Contributor II

PS: I've also tried installing pdf2image & poppler-utils into the libraries on the cluster, but still hitting same issue

Prabakar
Esteemed Contributor III
Esteemed Contributor III

@Kaniz Fatma​ , would you be able to import the forum content here?

https://forums.databricks.com/questions/62529/pdf-to-image-using-poppler.html

Hubert-Dudek
Esteemed Contributor III

Hubert-Dudek
Esteemed Contributor III

try to modify poppler_path option. Try \usr\bin or \usr\local\bin or just space " ". If not work please check cluster environment variables where poppler is installed. Example:

convert_from_path(file, 200, poppler_path='\usr\bin')

dg
New Contributor II

Thanks for the suggestion HubertDudek. Unfortunately after a few hours attempting to get this running with your path suggestion I've given up & moved the convert from pdf-->png to another part of the data pipeline.

wbrandler
New Contributor II

I couldn't figure out how to get pdf2image working either, instead i installed magick and converted pdf to png, then read in with matplotlib

%sh
cd /opt
wget https://imagemagick.org/archive/binaries/magick
chmod 777 magick
%sh
/opt/magick convert $vcfFiles/$sample_id.pdf $vcfFiles/$sample_id.png 
import matplotlib.pyplot as plt
import matplotlib.image as img
  
im = img.imread(vcfFiles +  sample_id + ".png")
plt.imshow(im)
display()

cheers

Slalom_Tobias
New Contributor III

Seems like this thread has died, but for posterity, databricks provides the following code for installing poppler on a cluster. The code is sourced from the dbdemos accelerators, specifically the "LLM Chatbot With Retrieval Augmented Generation (RAG) and Llama 2 70B" (https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html#) demo. In the 01-PDF-Advanced-Data-Preparation notebook there's code to remote execute the 00-init-advanced notebook and in that notebook, you'll find the code below. 

#install poppler on the cluster (should be done by init scripts)
def install_ocr_on_nodes():
    """
    install poppler on the cluster (should be done by init scripts)
    """
    # from pyspark.sql import SparkSession
    import subprocess
    num_workers = max(1,int(spark.conf.get("spark.databricks.clusterUsageTags.clusterWorkers")))
    command = "sudo rm -rf /var/cache/apt/archives/* /var/lib/apt/lists/* && sudo apt-get clean && sudo apt-get update && sudo apt-get install poppler-utils tesseract-ocr -y" 
    def run_subprocess(command):
        try:
            output = subprocess.check_output(command, stderr=subprocess.STDOUT, shell=True)
            return output.decode()
        except subprocess.CalledProcessError as e:
            raise Exception("An error occurred installing OCR libs:"+ e.output.decode())
    #install on the driver
    run_subprocess(command)
    def run_command(iterator):
        for x in iterator:
            yield run_subprocess(command)
    # spark = SparkSession.builder.getOrCreate()
    data = spark.sparkContext.parallelize(range(num_workers), num_workers) 
    # Use mapPartitions to run command in each partition (worker)
    output = data.mapPartitions(run_command)
    try:
        output.collect();
        print("OCR libraries installed")
    except Exception as e:
        print(f"Couldn't install on all node: {e}")
        raise e
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!