You can use PDF Data Source for read data from pdf files. Examples here: https://stabrise.com/blog/spark-pdf-on-databricks/
And after that use Scale DP library for extract data from the text in declarative way using LLM. Here is example of extraction data from the scanned receipts:
class ReceiptSchema(BaseModel):
"""Receipt."""
company_name: str
shop_name: str
company_type: CompanyType = Field(
description="Type of the company.",
examples=["MARKET", "PHARMACY"],
)
address: Address
tax_id: str
transaction_date: date = Field(description="Date of the transaction")
transaction_time: time = Field(description="Time of the transaction")
total_amount: float
items: list[ReceiptItem]
extractor = LLMExtractor(model="gemini-1.5-flash", schema=ReceiptSchema, inputCol="text")
extractor.transform(input_df)
And us result you will have json:
{
"company_name": "ROSHEN",
"shop_name": "TOM HRA",
"address": "ะผ, ะัะฝะฝะธัั, ะัะป, ะะตะปะตััะบะฐ, /B B",
"tax_id": "228739826104",
"transaction_date": "23-10-2824 20:15:52",
"total_amount": 328.06,
"items": [
{
"name": "ะจะพะบะพะปะฐะด ัะพัะฝะธะน Brut 805",
"quantity": 1.0,
"price_per_unit": 46.31,
"hko": "4823677632570",
"price": 46.31
},
{
"name": "ะจะพะบะพะปะฐะด ัะพัะฝะธะน Brut 80%",
"quantity": 1.0,
"price_per_unit": 46.31,
"hko": "4893877632570",
"price": 46.31
},
{
"name": "ะจะพะบะพะปะฐะด ัะพัะฝะธะน Special",
"quantity": 5.0,
"price_per_unit": 33.84,
"hko": "4803077632563",
"price": 169.2
},
{
"name": "ะะฐัะฐะผะตะปั LolliPops 3 ko",
"quantity": 8.0,
"price_per_unit": 18.51,
"hko": "150",
"price": 148.08
},
{
"name": "ะะฐะปั Wafers ะณะพััั
216r",
"quantity": 1.0,
"price_per_unit": 29.17,
"hko": "4823877625626",
"price": 29.17
}
]
}
I'm developing document processing using Spark.