cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Export Databricks results to Blob in a csv file

frank26364
New Contributor III

Hello everyone,

I want to export my data from Databricks to the blob. My Databricks commands select some pdf from my blob, run Form Recognizer and export the output results in my blob. Here is the code:

 %pip install azure.storage.blob
    %pip install azure.ai.formrecognizer
    
  
    from azure.storage.blob import ContainerClient
    
    container_url = "https://mystorageaccount.blob.core.windows.net/pdf-raw"
    container = ContainerClient.from_container_url(container_url)
    
    for blob in container.list_blobs():
    blob_url = container_url + "/" + blob.name
    print(blob_url)
 
 
import requests
from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential
 
endpoint = "https://myendpoint.cognitiveservices.azure.com/"
key = "mykeynumber"
 
form_recognizer_client = FormRecognizerClient(endpoint, credential=AzureKeyCredential(key))
 
   
    import pandas as pd
    
    field_list = ["InvoiceDate","InvoiceID","Items","VendorName"]
    df = pd.DataFrame(columns=field_list)
    
    for blob in container.list_blobs():
        blob_url = container_url + "/" + blob.name
        poller = form_recognizer_client.begin_recognize_invoices_from_url(invoice_url=blob_url)
        invoices = poller.result()
        print("Scanning " + blob.name + "...")
    
        for idx, invoice in enumerate(invoices):
            single_df = pd.DataFrame(columns=field_list)
            
            for field in field_list:
                entry = invoice.fields.get(field)
                
                if entry:
                    single_df[field] = [entry.value]
                    
                single_df['FileName'] = blob.name
                df = df.append(single_df)
                
    df = df.reset_index(drop=True)
    df
    
 
    account_name = "mystorageaccount"
    account_key = "fs.azure.account.key." + account_name + ".blob.core.windows.net"
    
    try:
        dbutils.fs.mount(
            source = "wasbs://pdf-recognized@mystorageaccount.blob.core.windows.net",
            mount_point = "/mnt/pdf-recognized",
            extra_configs = {account_key: dbutils.secrets.get(scope ="formrec", key="formreckey")} )
        
    except:
        print('Directory already mounted or error')
    
    df.to_csv(r"/dbfs/mnt/pdf-recognized/output.csv", index=False)

The code works well until the very last line. I get the following error message: FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/mnt/pdf-recognized/output.csv'.

I tried using /dbfs:/ instead of /dbfs/ but I don't know what I am doing wrong.

How can I export my Databricks results to the blob?

Thank you

1 ACCEPTED SOLUTION

Accepted Solutions

frank26364
New Contributor III

Hi, I am new to databricks and this code was taken from a tutorial I found. The reason why the error happened was that I had no secrets scope mapped in databricks. Once I setup the secrets scope the code worked correctly.

Thank you everyone for your help!

View solution in original post

5 REPLIES 5

Hubert-Dudek
Esteemed Contributor III

please verify that directory exists:

dbutils.fs.ls("/dbfs/mnt/pdf-recognized")

frank26364
New Contributor III

Thank you for your reply Hubert.

When I run dbutils.fs.ls("/dbfs/mnt/pdf-recognized") I get the error message saying that the directory doesn't exist. I double checked the spelling and the container is really in that storage account. I don't know why it tells me that.

frank26364
New Contributor III

Hi Kaniz, thank you for you response.

I tried %s/pdf-recognized/output.csv but I received the following error message:

UsageError: Line magic function `%s/pdf-recognized/output.csv` not found

Could you confirm if this would be the way to add this line:

account_name = "mystorageaccount"
account_key = "fs.azure.account.key." + account_name + ".blob.core.windows.net"
    
try:
    dbutils.fs.mount(
        source = "wasbs://pdf-recognized@mystorageaccount.blob.core.windows.net",
        mount_point = "/mnt/pdf-recognized",
        extra_configs = {account_key: dbutils.secrets.get(scope ="formrec", key="formreckey")} )
       
except:
    print('Directory already mounted or error')
%s/pdf-recognized/output.csv

Thank you

frank26364
New Contributor III

Hi, I am new to databricks and this code was taken from a tutorial I found. The reason why the error happened was that I had no secrets scope mapped in databricks. Once I setup the secrets scope the code worked correctly.

Thank you everyone for your help!

Anonymous
Not applicable

@Francis Boulianeโ€‹ - Thank you for sharing the solution.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group