Databricks Community

neha_ayodhya · ‎12-21-2023

I'm trying to extract the text data from image file in Databricks notebook I have installed below libraries using pip command: %pip install pytesseract tesseract pillow --upgrade

but it didn't work and threw below error pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your PATH. See README file for more information.

I then installed below the libraries using the libraries section of cluster in Databricks:

pillow
pytesseract
tesseract

But this didn't work too.

later i ran the below shell command in Databricks notebook cell:

%sh

apt-get install -y tesseract-ocr

This command gave me below error: E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied) E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?

Here is my code which i want to run in my databricks notebook:

img=img_path

img_gray = img.convert('L')

text = pytesseract.image_to_string(img_gray)

I want the code to extract the textual data accurately from images Please let me know where am i doing mistake?

shan_chandra · ‎01-11-2024

Hi @neha_ayodhya - can you please try the following via an init script to the Databricks cluster

sudo apt-get update -y
sudo apt-get install -y tesseract-ocr
sudo apt-get install -y libtesseract-dev
/databricks/python/bin/pip install pytesseract

and let us know.

Thanks, Shan

Databricks Community

pytesseract.pytesseract.TesseractNotFoundError in databricks notebook

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!