Init script works fine on All purpose compute but have issues with Job compute created from DLT ETL
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
a week ago
Hi I was following Databricks tutorial from https://notebooks.databricks.com/demos/llm-rag-chatbot the old one where it had reference on how to install OCR on nodes(install poppler on the cluster) to read the pdf content.
I created below init script to install poppler on my "All purpose cluster" and it works for me with no issues, I was able to make use of unstructured to read the PDF even the scanned ones.
sudo rm -r /var/lib/apt/lists/*
sudo apt clean && sudo apt update --fix-missing -y
sudo apt-get install poppler-utils tesseract-ocr -y
using above init script solved following error "pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?"
I made use of Streaming Library and above init script configured on All purpose compute to process my PDF's from volume.
But I wanted to do it with DLT tables since it makes things easier no need to mention checkpoints etc. I have created ETL pipeline since I cant use normal compute to execute this and the resultant Job compute has init script configured with help of cluster policy. Checked the event logs and init script had no errors and it's executed and resource is up and running.
But I am having the error "pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?" which I have resolved back with the init_script and it works for All purpose compute and not in ETL pipeline when DLT is involved with Job compute.
Can anyone help me understand what's different and what I can do to solve this.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
a week ago
Hi @kbmv,
For debugging, can you add this command to your DLT pipeline to ensure installation is good?
pdfinfo --version
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
a week ago
Additionally, looking at the all purpose installation, can you check the driver logs (stdout) log file, it should show the output of the installation, can you do the same for the job cluster?