12-20-2021 05:38 AM
Hello everyone,
I want to export my data from Databricks to the blob. My Databricks commands select some pdf from my blob, run Form Recognizer and export the output results in my blob. Here is the code:
%pip install azure.storage.blob
%pip install azure.ai.formrecognizer
from azure.storage.blob import ContainerClient
container_url = "https://mystorageaccount.blob.core.windows.net/pdf-raw"
container = ContainerClient.from_container_url(container_url)
for blob in container.list_blobs():
blob_url = container_url + "/" + blob.name
print(blob_url)
import requests
from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential
endpoint = "https://myendpoint.cognitiveservices.azure.com/"
key = "mykeynumber"
form_recognizer_client = FormRecognizerClient(endpoint, credential=AzureKeyCredential(key))
import pandas as pd
field_list = ["InvoiceDate","InvoiceID","Items","VendorName"]
df = pd.DataFrame(columns=field_list)
for blob in container.list_blobs():
blob_url = container_url + "/" + blob.name
poller = form_recognizer_client.begin_recognize_invoices_from_url(invoice_url=blob_url)
invoices = poller.result()
print("Scanning " + blob.name + "...")
for idx, invoice in enumerate(invoices):
single_df = pd.DataFrame(columns=field_list)
for field in field_list:
entry = invoice.fields.get(field)
if entry:
single_df[field] = [entry.value]
single_df['FileName'] = blob.name
df = df.append(single_df)
df = df.reset_index(drop=True)
df
account_name = "mystorageaccount"
account_key = "fs.azure.account.key." + account_name + ".blob.core.windows.net"
try:
dbutils.fs.mount(
source = "wasbs://pdf-recognized@mystorageaccount.blob.core.windows.net",
mount_point = "/mnt/pdf-recognized",
extra_configs = {account_key: dbutils.secrets.get(scope ="formrec", key="formreckey")} )
except:
print('Directory already mounted or error')
df.to_csv(r"/dbfs/mnt/pdf-recognized/output.csv", index=False)
The code works well until the very last line. I get the following error message: FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/mnt/pdf-recognized/output.csv'.
I tried using /dbfs:/ instead of /dbfs/ but I don't know what I am doing wrong.
How can I export my Databricks results to the blob?
Thank you
01-21-2022 07:17 AM
Hi, I am new to databricks and this code was taken from a tutorial I found. The reason why the error happened was that I had no secrets scope mapped in databricks. Once I setup the secrets scope the code worked correctly.
Thank you everyone for your help!
12-20-2021 06:24 AM
please verify that directory exists:
dbutils.fs.ls("/dbfs/mnt/pdf-recognized")
12-21-2021 05:43 AM
Thank you for your reply Hubert.
When I run dbutils.fs.ls("/dbfs/mnt/pdf-recognized") I get the error message saying that the directory doesn't exist. I double checked the spelling and the container is really in that storage account. I don't know why it tells me that.
01-07-2022 09:17 PM
Hi @Francis Bouliane , Please try this instead :-
%s/pdf-recognized/output.csv
01-09-2022 08:24 AM
Hi Kaniz, thank you for you response.
I tried %s/pdf-recognized/output.csv but I received the following error message:
UsageError: Line magic function `%s/pdf-recognized/output.csv` not found
Could you confirm if this would be the way to add this line:
account_name = "mystorageaccount"
account_key = "fs.azure.account.key." + account_name + ".blob.core.windows.net"
try:
dbutils.fs.mount(
source = "wasbs://pdf-recognized@mystorageaccount.blob.core.windows.net",
mount_point = "/mnt/pdf-recognized",
extra_configs = {account_key: dbutils.secrets.get(scope ="formrec", key="formreckey")} )
except:
print('Directory already mounted or error')
%s/pdf-recognized/output.csv
Thank you
01-21-2022 07:17 AM
Hi, I am new to databricks and this code was taken from a tutorial I found. The reason why the error happened was that I had no secrets scope mapped in databricks. Once I setup the secrets scope the code worked correctly.
Thank you everyone for your help!
01-22-2022 02:53 AM
Awesome!
01-21-2022 12:01 PM
@Francis Bouliane - Thank you for sharing the solution.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group