Databricks Community

kbmv · ‎01-15-2025

Hi I was following Databricks tutorial from https://notebooks.databricks.com/demos/llm-rag-chatbot the old one where it had reference on how to install OCR on nodes(install poppler on the cluster) to read the pdf content.

I created below init script to install poppler on my "All purpose cluster" and it works for me with no issues, I was able to make use of unstructured to read the PDF even the scanned ones.

sudo rm -r /var/lib/apt/lists/* 
sudo apt clean && sudo apt update --fix-missing -y
sudo apt-get install poppler-utils tesseract-ocr -y

using above init script solved following error "pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?"

I made use of Streaming Library and above init script configured on All purpose compute to process my PDF's from volume.

But I wanted to do it with DLT tables since it makes things easier no need to mention checkpoints etc. I have created ETL pipeline since I cant use normal compute to execute this and the resultant Job compute has init script configured with help of cluster policy. Checked the event logs and init script had no errors and it's executed and resource is up and running.

But I am having the error "pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?" which I have resolved back with the init_script and it works for All purpose compute and not in ETL pipeline when DLT is involved with Job compute.

Can anyone help me understand what's different and what I can do to solve this.

Thanks

kbmv · ‎02-06-2025

Hi Alberto_Umana,

Thanks for looking into it, I got solution from databricks support assigned for my corporation.

The issue was more with cluster type and not Streaming or DLT. For Streaming I was able to use Single User compute but for DLT since we can't configure what type of compute to use and by default it uses shared mode compute it doesn't work for DLT.

Poppler-Utils cant be used on shared mode cluster and in detail info is available at : https://community.databricks.com/t5/data-engineering/unable-to-install-poppler-utils/m-p/106570#M425...

Thanks

View solution in original post

Alberto_Umana · ‎01-15-2025

Hi @kbmv,

For debugging, can you add this command to your DLT pipeline to ensure installation is good?

pdfinfo --version

Alberto_Umana · ‎01-15-2025

Additionally, looking at the all purpose installation, can you check the driver logs (stdout) log file, it should show the output of the installation, can you do the same for the job cluster?

kbmv · ‎02-06-2025

Hi Alberto_Umana,

Thanks for looking into it, I got solution from databricks support assigned for my corporation.

The issue was more with cluster type and not Streaming or DLT. For Streaming I was able to use Single User compute but for DLT since we can't configure what type of compute to use and by default it uses shared mode compute it doesn't work for DLT.

Poppler-Utils cant be used on shared mode cluster and in detail info is available at : https://community.databricks.com/t5/data-engineering/unable-to-install-poppler-utils/m-p/106570#M425...

Thanks