cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Export Databricks results to Blob in a csv file

frank26364
New Contributor III

Hello everyone,

I want to export my data from Databricks to the blob. My Databricks commands select some pdf from my blob, run Form Recognizer and export the output results in my blob. Here is the code:

 %pip install azure.storage.blob
    %pip install azure.ai.formrecognizer
    
  
    from azure.storage.blob import ContainerClient
    
    container_url = "https://mystorageaccount.blob.core.windows.net/pdf-raw"
    container = ContainerClient.from_container_url(container_url)
    
    for blob in container.list_blobs():
    blob_url = container_url + "/" + blob.name
    print(blob_url)
 
 
import requests
from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential
 
endpoint = "https://myendpoint.cognitiveservices.azure.com/"
key = "mykeynumber"
 
form_recognizer_client = FormRecognizerClient(endpoint, credential=AzureKeyCredential(key))
 
   
    import pandas as pd
    
    field_list = ["InvoiceDate","InvoiceID","Items","VendorName"]
    df = pd.DataFrame(columns=field_list)
    
    for blob in container.list_blobs():
        blob_url = container_url + "/" + blob.name
        poller = form_recognizer_client.begin_recognize_invoices_from_url(invoice_url=blob_url)
        invoices = poller.result()
        print("Scanning " + blob.name + "...")
    
        for idx, invoice in enumerate(invoices):
            single_df = pd.DataFrame(columns=field_list)
            
            for field in field_list:
                entry = invoice.fields.get(field)
                
                if entry:
                    single_df[field] = [entry.value]
                    
                single_df['FileName'] = blob.name
                df = df.append(single_df)
                
    df = df.reset_index(drop=True)
    df
    
 
    account_name = "mystorageaccount"
    account_key = "fs.azure.account.key." + account_name + ".blob.core.windows.net"
    
    try:
        dbutils.fs.mount(
            source = "wasbs://pdf-recognized@mystorageaccount.blob.core.windows.net",
            mount_point = "/mnt/pdf-recognized",
            extra_configs = {account_key: dbutils.secrets.get(scope ="formrec", key="formreckey")} )
        
    except:
        print('Directory already mounted or error')
    
    df.to_csv(r"/dbfs/mnt/pdf-recognized/output.csv", index=False)

The code works well until the very last line. I get the following error message: FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/mnt/pdf-recognized/output.csv'.

I tried using /dbfs:/ instead of /dbfs/ but I don't know what I am doing wrong.

How can I export my Databricks results to the blob?

Thank you

1 ACCEPTED SOLUTION

Accepted Solutions

frank26364
New Contributor III

Hi, I am new to databricks and this code was taken from a tutorial I found. The reason why the error happened was that I had no secrets scope mapped in databricks. Once I setup the secrets scope the code worked correctly.

Thank you everyone for your help!

View solution in original post

7 REPLIES 7

Hubert-Dudek
Esteemed Contributor III

please verify that directory exists:

dbutils.fs.ls("/dbfs/mnt/pdf-recognized")

frank26364
New Contributor III

Thank you for your reply Hubert.

When I run dbutils.fs.ls("/dbfs/mnt/pdf-recognized") I get the error message saying that the directory doesn't exist. I double checked the spelling and the container is really in that storage account. I don't know why it tells me that.

Kaniz
Community Manager
Community Manager

Hi @Francis Bouliane​ , Please try this instead :-

%s/pdf-recognized/output.csv

frank26364
New Contributor III

Hi Kaniz, thank you for you response.

I tried %s/pdf-recognized/output.csv but I received the following error message:

UsageError: Line magic function `%s/pdf-recognized/output.csv` not found

Could you confirm if this would be the way to add this line:

account_name = "mystorageaccount"
account_key = "fs.azure.account.key." + account_name + ".blob.core.windows.net"
    
try:
    dbutils.fs.mount(
        source = "wasbs://pdf-recognized@mystorageaccount.blob.core.windows.net",
        mount_point = "/mnt/pdf-recognized",
        extra_configs = {account_key: dbutils.secrets.get(scope ="formrec", key="formreckey")} )
       
except:
    print('Directory already mounted or error')
%s/pdf-recognized/output.csv

Thank you

frank26364
New Contributor III

Hi, I am new to databricks and this code was taken from a tutorial I found. The reason why the error happened was that I had no secrets scope mapped in databricks. Once I setup the secrets scope the code worked correctly.

Thank you everyone for your help!

Kaniz
Community Manager
Community Manager

Awesome!

Anonymous
Not applicable

@Francis Bouliane​ - Thank you for sharing the solution.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.