cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Convert pdf's is into structured data

User16826987838
Contributor

Is there anything on Databricks to help read PDF (payment invoices and receipts for example) and convert it to structured data?

2 REPLIES 2

Anonymous
Not applicable

Several open source options. For ex Tesseract

def ocr_image(image_bytes):
 
  return pytesseract.image_to_string(Image.open(io.BytesIO(image_bytes)))

SoniaFoster
New Contributor II

Thanks! Converting PDF format is sometimes a difficult task as not all converters provide accuracy. I want to share with you one interesting tool I recently discovered that can make your work even more efficient. I recently came across an amazing online tool https://pdfflex.com/docx-to-pdf  that allows you to convert DOCX to PDF effortlessly. All you have to do is upload your DOCX file and it will be converted in seconds. And you can easily download the pdf file. 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.