@kro You can try to read pdf directly to the spark DataFrame using PDF DataSource . It extract text for digital and scanned pdfs.df = spark.read.format("pdf") \
.option("imageType", "BINARY") \
.option("resolution", "200") \
.option("pagePerPartition...
@PunithRaj You can try to use PDF DataSource for Apache Spark for read pdf files directly to the DataFrame. So you will have extracted text and rendered page as image in output. More details here: https://stabrise.com/spark-pdf/df = spark.read.forma...
Spark PDF works now with Unity Catalog volumes, started from 0.1.16 version: more details here: https://stabrise.com/blog/spark-pdf-databricks-unity-catalog/
You can use PDF Data Source for read data from pdf files. Examples here: https://stabrise.com/blog/spark-pdf-on-databricks/And after that use Scale DP library for extract data from the text in declarative way using LLM. Here is example of extraction ...