cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Gathering Data Off Of A PDF File

trimethylpurine
New Contributor II

Hello everyone,

I am developing an application that accepts pdf files and inserts the data into my database. The company in question that distributes this data to us only offers PDF files, which you can see attached below (I hid personal info for privacy reasons). Do any of you know of a tool or route that would make this doable?

Any advice or ideas would be appreciated.

Thank you!

4 REPLIES 4

Hello @Retired_mod, I really appreciate your help!

This makes a lot of sense.

I won't be able to see the answer, can you please share it?

NicholasGray
New Contributor II

Thank you so much for the help.

And if anyone over here is searching for a writer then you can visit https://academized.com/professional-essay-writers here where you will easily find them. I am also using that website.

Mykola_Melnyk
New Contributor III

You can use PDF Data Source for read data from pdf files. Examples here: https://stabrise.com/blog/spark-pdf-on-databricks/

And after that use Scale DP library for extract data from the text in declarative way using LLM. Here is example of extraction data from the scanned receipts:

class ReceiptSchema(BaseModel):
"""Receipt."""

company_name: str
shop_name: str
company_type: CompanyType = Field(
description="Type of the company.",
examples=["MARKET", "PHARMACY"],
)
address: Address
tax_id: str
transaction_date: date = Field(description="Date of the transaction")
transaction_time: time = Field(description="Time of the transaction")
total_amount: float
items: list[ReceiptItem]
extractor = LLMExtractor(model="gemini-1.5-flash", schema=ReceiptSchema, inputCol="text")
extractor.transform(input_df)

And us result you will have json:

 

{
    "company_name": "ROSHEN",
    "shop_name": "TOM HRA",
    "address": "ะผ, ะ’ั–ะฝะฝะธั†ั, ะ’ัƒะป, ะšะตะปะตั†ัŒะบะฐ, /B B",
    "tax_id": "228739826104",
    "transaction_date": "23-10-2824 20:15:52",
    "total_amount": 328.06,
    "items": [
        {
            "name": "ะจะพะบะพะปะฐะด ั‡ะพั€ะฝะธะน Brut 805",
            "quantity": 1.0,
            "price_per_unit": 46.31,
            "hko": "4823677632570",
            "price": 46.31
        },
        {
            "name": "ะจะพะบะพะปะฐะด ั‡ะพั€ะฝะธะน Brut 80%",
            "quantity": 1.0,
            "price_per_unit": 46.31,
            "hko": "4893877632570",
            "price": 46.31
        },
        {
            "name": "ะจะพะบะพะปะฐะด ั‡ะพั€ะฝะธะน Special",
            "quantity": 5.0,
            "price_per_unit": 33.84,
            "hko": "4803077632563",
            "price": 169.2
        },
        {
            "name": "ะšะฐั€ะฐะผะตะปัŒ LolliPops 3 ko",
            "quantity": 8.0,
            "price_per_unit": 18.51,
            "hko": "150",
            "price": 148.08
        },
        {
            "name": "ะ’ะฐะปั– Wafers ะณะพั€ั–ั… 216r",
            "quantity": 1.0,
            "price_per_unit": 29.17,
            "hko": "4823877625626",
            "price": 29.17
        }
    ]
}

 






I'm developing document processing using Spark.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group