cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Init script works fine on All purpose compute but have issues with Job compute created from DLT ETL

kbmv
New Contributor II

Hi I was following Databricks tutorial from https://notebooks.databricks.com/demos/llm-rag-chatbot the old one where it had reference on how to install OCR on nodes(install poppler on the cluster) to read the pdf content.

I created below init script to install poppler on my "All purpose cluster" and it works for me with no issues, I was able to make use of unstructured to read the PDF even the scanned ones.

 

sudo rm -r /var/lib/apt/lists/* 
sudo apt clean && sudo apt update --fix-missing -y
sudo apt-get install poppler-utils tesseract-ocr -y

 

using above init script solved following error "pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?"

I made use of Streaming Library and above init script configured on All purpose compute to process my PDF's from volume. 

But I wanted to do it with DLT tables since it makes things easier no need to mention checkpoints etc. I have created ETL pipeline since I cant use normal compute to execute this and the resultant Job compute has init script configured with help of cluster policy. Checked the event logs and init script had no errors and it's executed and resource is up and running.

But I am having the error "pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?" which I have resolved back with the init_script and it works for All purpose compute and not in ETL pipeline when DLT is involved with Job compute.

Can anyone help me understand what's different and what I can do to solve this.

Thanks

2 REPLIES 2

Alberto_Umana
Databricks Employee
Databricks Employee

Hi @kbmv,

For debugging, can you add this command to your DLT pipeline to ensure installation is good?

pdfinfo --version

Alberto_Umana
Databricks Employee
Databricks Employee

Additionally, looking at the all purpose installation, can you check the driver logs (stdout) log file, it should show the output of the installation, can you do the same for the job cluster?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group