cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Accessing Excel file from Databricks

JohnJustus
New Contributor III

Hi,

I am trying to access excel file that is stored in Azure Blob storage via Databricks.

In my understanding, it is not possible to access using Pyspark. So accessing through Pandas is the option,

Here is my code.

%pip install openpyxl
import pandas as pd
from pandas import ExcelFile
pdf = pd.ExcelFile('/mnt/dl-sy-team-dev/AD/AbsentEmployees.xls/', engine='openpyxl')
print(pdf)

ERROR

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/dl-sy-team-dev/ADP/AbsentEmployees.xls/'

The file is definitely there and has proper permissions as I can browse using dbutils.fs.ls command in databricks.

Can some one please help me with this error as to how to fix this?

Thanks,

John

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @JohnJustus , Unfortunately, Pandas does not directly support reading Excel files from Azure Blob Storage using the wasbs protocol.

 

  • Here are a couple of alternative approaches you can consider:

a. Use HTTPS with SAS Token: - Create a Shared Access Signature (SAS) token for your Blob storage container. - Access the Excel file using the HTTPS protocol with the SAS token.

 

b. Download as Stream and Read Locally: - Download the Excel file as a stream using the Azure Storage Python SDK. - Read the stream into a Pandas DataFrame. 

View solution in original post

3 REPLIES 3

Kaniz
Community Manager
Community Manager

Hi @JohnJustus , 

To resolve the FileNotFoundError when reading from Azure Blob Storage in Databricks, you need to use the "wasbs" protocol for the file path reference instead of the local file system path. Here's a summary of the steps to address this issue:

  1. Install the "openpyxl" library if it's not already installed.

  2. Define your Azure Blob Storage credentials, including the account name, container name, relative path to your Excel file, and the SAS token.

  3. Utilize the "wasbs" protocol in your file path reference to read Excel files directly from Azure Blob Storage using Pandas in Databricks.

By following these steps and filling in the appropriate values, you'll be able to read files from Azure Blob Storage without encountering a FileNotFoundError.

JohnJustus
New Contributor III

Thank you Kaniz for the response.

1. Completed
2. This is completed as part of our initial environment setup OR is this something needs to be defined before the code below?
3. import pandas as pd
xls = pd.ExcelFile('wasbs://mnt/dl-sync-team-dev/ADP/AbsentEmployees.xls/') -- Accessing the excel file in Blob container using wasbs
network_master = pd.read_excel(xls, 'Sheet1')

Error : ValueError: Protocol not known: wasbs
Can you please take a look and correct me if there is anything wrong with my code or is there anything we need to configure in the environment for wasbs protocol?

Thanks,
John

Kaniz
Community Manager
Community Manager

Hi @JohnJustus , Unfortunately, Pandas does not directly support reading Excel files from Azure Blob Storage using the wasbs protocol.

 

  • Here are a couple of alternative approaches you can consider:

a. Use HTTPS with SAS Token: - Create a Shared Access Signature (SAS) token for your Blob storage container. - Access the Excel file using the HTTPS protocol with the SAS token.

 

b. Download as Stream and Read Locally: - Download the Excel file as a stream using the Azure Storage Python SDK. - Read the stream into a Pandas DataFrame. 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.