Hi I was following Databricks tutorial from https://notebooks.databricks.com/demos/llm-rag-chatbot the old one where it had reference on how to install OCR on nodes(install poppler on the cluster) to read the pdf content.
I created below init script to install poppler on my "All purpose cluster" and it works for me with no issues, I was able to make use of unstructured to read the PDF even the scanned ones.
sudo rm -r /var/lib/apt/lists/*
sudo apt clean && sudo apt update --fix-missing -y
sudo apt-get install poppler-utils tesseract-ocr -y
using above init script solved following error "pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?"
I made use of Streaming Library and above init script configured on All purpose compute to process my PDF's from volume.
But I wanted to do it with DLT tables since it makes things easier no need to mention checkpoints etc. I have created ETL pipeline since I cant use normal compute to execute this and the resultant Job compute has init script configured with help of cluster policy. Checked the event logs and init script had no errors and it's executed and resource is up and running.
But I am having the error "pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?" which I have resolved back with the init_script and it works for All purpose compute and not in ETL pipeline when DLT is involved with Job compute.
Can anyone help me understand what's different and what I can do to solve this.
Thanks