- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-20-2021 05:38 AM
Hello everyone,
I want to export my data from Databricks to the blob. My Databricks commands select some pdf from my blob, run Form Recognizer and export the output results in my blob. Here is the code:
%pip install azure.storage.blob
%pip install azure.ai.formrecognizer
from azure.storage.blob import ContainerClient
container_url = "https://mystorageaccount.blob.core.windows.net/pdf-raw"
container = ContainerClient.from_container_url(container_url)
for blob in container.list_blobs():
blob_url = container_url + "/" + blob.name
print(blob_url)
import requests
from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential
endpoint = "https://myendpoint.cognitiveservices.azure.com/"
key = "mykeynumber"
form_recognizer_client = FormRecognizerClient(endpoint, credential=AzureKeyCredential(key))
import pandas as pd
field_list = ["InvoiceDate","InvoiceID","Items","VendorName"]
df = pd.DataFrame(columns=field_list)
for blob in container.list_blobs():
blob_url = container_url + "/" + blob.name
poller = form_recognizer_client.begin_recognize_invoices_from_url(invoice_url=blob_url)
invoices = poller.result()
print("Scanning " + blob.name + "...")
for idx, invoice in enumerate(invoices):
single_df = pd.DataFrame(columns=field_list)
for field in field_list:
entry = invoice.fields.get(field)
if entry:
single_df[field] = [entry.value]
single_df['FileName'] = blob.name
df = df.append(single_df)
df = df.reset_index(drop=True)
df
account_name = "mystorageaccount"
account_key = "fs.azure.account.key." + account_name + ".blob.core.windows.net"
try:
dbutils.fs.mount(
source = "wasbs://pdf-recognized@mystorageaccount.blob.core.windows.net",
mount_point = "/mnt/pdf-recognized",
extra_configs = {account_key: dbutils.secrets.get(scope ="formrec", key="formreckey")} )
except:
print('Directory already mounted or error')
df.to_csv(r"/dbfs/mnt/pdf-recognized/output.csv", index=False)
The code works well until the very last line. I get the following error message: FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/mnt/pdf-recognized/output.csv'.
I tried using /dbfs:/ instead of /dbfs/ but I don't know what I am doing wrong.
How can I export my Databricks results to the blob?
Thank you
- Labels:
-
Azure
-
Blob
-
Form Recognizer
-
Mount Point
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-21-2022 07:17 AM
Hi, I am new to databricks and this code was taken from a tutorial I found. The reason why the error happened was that I had no secrets scope mapped in databricks. Once I setup the secrets scope the code worked correctly.
Thank you everyone for your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-20-2021 06:24 AM
please verify that directory exists:
dbutils.fs.ls("/dbfs/mnt/pdf-recognized")
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-21-2021 05:43 AM
Thank you for your reply Hubert.
When I run dbutils.fs.ls("/dbfs/mnt/pdf-recognized") I get the error message saying that the directory doesn't exist. I double checked the spelling and the container is really in that storage account. I don't know why it tells me that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-09-2022 08:24 AM
Hi Kaniz, thank you for you response.
I tried %s/pdf-recognized/output.csv but I received the following error message:
UsageError: Line magic function `%s/pdf-recognized/output.csv` not found
Could you confirm if this would be the way to add this line:
account_name = "mystorageaccount"
account_key = "fs.azure.account.key." + account_name + ".blob.core.windows.net"
try:
dbutils.fs.mount(
source = "wasbs://pdf-recognized@mystorageaccount.blob.core.windows.net",
mount_point = "/mnt/pdf-recognized",
extra_configs = {account_key: dbutils.secrets.get(scope ="formrec", key="formreckey")} )
except:
print('Directory already mounted or error')
%s/pdf-recognized/output.csv
Thank you
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-21-2022 07:17 AM
Hi, I am new to databricks and this code was taken from a tutorial I found. The reason why the error happened was that I had no secrets scope mapped in databricks. Once I setup the secrets scope the code worked correctly.
Thank you everyone for your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-21-2022 12:01 PM
@Francis Bouliane - Thank you for sharing the solution.
data:image/s3,"s3://crabby-images/cb5bb/cb5bb73aed1093bf2bbc88d029c5de02e8c5cfc3" alt=""
data:image/s3,"s3://crabby-images/cb5bb/cb5bb73aed1093bf2bbc88d029c5de02e8c5cfc3" alt=""