cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Convert pdf's is into structured data

User16826987838
Contributor

Is there anything on Databricks to help read PDF (payment invoices and receipts for example) and convert it to structured data?

2 REPLIES 2

Anonymous
Not applicable

Several open source options. For ex Tesseract

def ocr_image(image_bytes):
 
  return pytesseract.image_to_string(Image.open(io.BytesIO(image_bytes)))

SoniaFoster
New Contributor II

Thanks! Converting PDF format is sometimes a difficult task as not all converters provide accuracy. I want to share with you one interesting tool I recently discovered that can make your work even more efficient. I recently came across an amazing online tool https://pdfflex.com/docx-to-pdf  that allows you to convert DOCX to PDF effortlessly. All you have to do is upload your DOCX file and it will be converted in seconds. And you can easily download the pdf file. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group