โ06-20-2024 05:51 PM
Hello everyone,
I am developing an application that accepts pdf files and inserts the data into my database. The company in question that distributes this data to us only offers PDF files, which you can see attached below (I hid personal info for privacy reasons). Do any of you know of a tool or route that would make this doable?
Any advice or ideas would be appreciated.
Thank you!
โ06-21-2024 09:33 AM
Hello @Retired_mod, I really appreciate your help!
This makes a lot of sense.
โ12-23-2024 12:36 AM
I won't be able to see the answer, can you please share it?
โ07-12-2024 05:09 AM
Thank you so much for the help.
โ02-02-2025 09:33 AM
You can use PDF Data Source for read data from pdf files. Examples here: https://stabrise.com/blog/spark-pdf-on-databricks/
And after that use Scale DP library for extract data from the text in declarative way using LLM. Here is example of extraction data from the scanned receipts:
class ReceiptSchema(BaseModel):
"""Receipt."""
company_name: str
shop_name: str
company_type: CompanyType = Field(
description="Type of the company.",
examples=["MARKET", "PHARMACY"],
)
address: Address
tax_id: str
transaction_date: date = Field(description="Date of the transaction")
transaction_time: time = Field(description="Time of the transaction")
total_amount: float
items: list[ReceiptItem]
extractor = LLMExtractor(model="gemini-1.5-flash", schema=ReceiptSchema, inputCol="text")
extractor.transform(input_df)
And us result you will have json:
{
"company_name": "ROSHEN",
"shop_name": "TOM HRA",
"address": "ะผ, ะัะฝะฝะธัั, ะัะป, ะะตะปะตััะบะฐ, /B B",
"tax_id": "228739826104",
"transaction_date": "23-10-2824 20:15:52",
"total_amount": 328.06,
"items": [
{
"name": "ะจะพะบะพะปะฐะด ัะพัะฝะธะน Brut 805",
"quantity": 1.0,
"price_per_unit": 46.31,
"hko": "4823677632570",
"price": 46.31
},
{
"name": "ะจะพะบะพะปะฐะด ัะพัะฝะธะน Brut 80%",
"quantity": 1.0,
"price_per_unit": 46.31,
"hko": "4893877632570",
"price": 46.31
},
{
"name": "ะจะพะบะพะปะฐะด ัะพัะฝะธะน Special",
"quantity": 5.0,
"price_per_unit": 33.84,
"hko": "4803077632563",
"price": 169.2
},
{
"name": "ะะฐัะฐะผะตะปั LolliPops 3 ko",
"quantity": 8.0,
"price_per_unit": 18.51,
"hko": "150",
"price": 148.08
},
{
"name": "ะะฐะปั Wafers ะณะพััั
216r",
"quantity": 1.0,
"price_per_unit": 29.17,
"hko": "4823877625626",
"price": 29.17
}
]
}
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now