cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

To read data from Azure Storage

bchaubey
Contributor II

Hi Team,

May i know how to read Azure storage data in Databricks through Python.

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @Bhagwan Chaubey​ , Once you've uploaded your files to your blob container,

Step 1: Get credentials necessary for databricks to connect to your blob container

From your Azure portal, you need to navigate to all resources then select your blob storage account and from under the settings select account keys. Once there, copy the key under Key1 to a local notepad.

Step 2: Configure DataBricks to read the file

To start reading the data, first, you need to configure your spark session to use credentials for your blob container. This can simply be done through the spark.conf.set command.

storage_account_name = 'nameofyourstorageaccount'
storage_account_access_key = 'thekeyfortheblobcontainer'
spark.conf.set('fs.azure.account.key.' + storage_account_name + '.blob.core.windows.net', storage_account_access_key)

Once done, we need to build the file path in the blob container and read the file as a Spark data frame.

blob_container = 'yourblobcontainername'
filePath = "wasbs://" + blob_container + "@" + storage_account_name + ".blob.core.windows.net/Sales/SalesFile.csv"
salesDf = spark.read.format("csv").load(filePath, inferSchema = True, header = True)

And congrats, we are done.

You can use the display command to have a sneak peek at our data.

Below is a snapshot of my code.

Screenshot 2022-01-06 at 5.24.47 PM 

View solution in original post

18 REPLIES 18

Kaniz
Community Manager
Community Manager

Hi @ bchaubey! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

Kaniz
Community Manager
Community Manager

Hi @Bhagwan Chaubey​ , You can access your files using python through the below-mentioned code.

#Once you've mounted a Blob storage container or a folder inside a container through the code:-
 
dbutils.fs.mount(
  source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net",
  mount_point = "/mnt/<mount-name>",
  extra_configs = {"<conf-key>":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})
 
#read the csv data
df = spark.read.csv("dbfs:/mnt/%s/...." % <name-of-your-mount>)
display(df)

bchaubey
Contributor II

@Kaniz Fatma​  how can find value of mountPoint = "/mnt/<mount-name>"

Kaniz
Community Manager
Community Manager

Hi @Bhagwan Chaubey​ ,

<mount-name>  is a DBFS path representing where the Blob storage container or a folder inside the container (specified in the source) will be mounted in DBFS.

Have you created any folders inside your blob containers? If not, your mount point will be simply - "dbfs:/mnt/dataset.csv"

As you can see in the screenshot below:-

If I want to read my country_classification.csv file, in my case the mount point will be "dbfs:/mnt/country_classification.csv" as I've not created any folder or directory inside my blob.

Screenshot 2022-01-05 at 6.51.11 PMAdding the snap of my code here too:-

Screenshot 2022-01-05 at 6.55.11 PMPlease do let me know if you have any more doubts.

bchaubey
Contributor II

%scala

df = spark.read.csv("dbfs:/mnt/country_classification.csv")

 display(df)

may i know how can find dbfs:/mnt

Kaniz
Community Manager
Community Manager

Hi @Bhagwan Chaubey​ , Can you please browse this path in your Microsoft azure:- "storage_account/containers/directory_in_which_you've_uploaded_your_dataset"? That itself will be your mount point.

bchaubey
Contributor II

Hi @Kaniz Fatma​  I am facing issue during read the data. Please see the attachment

Kaniz
Community Manager
Community Manager

Hi @Bhagwan Chaubey​ , Can you please enter the correct scope and key names in the above code?

bchaubey
Contributor II

@Kaniz Fatma​  i have added correct key

Kaniz
Community Manager
Community Manager

Hi @Bhagwan Chaubey​ , There might be a different scope name or any wrong credentials. You need to recheck all the values again. However, I've provided another way to solve your query. Please try and let me know if it works.

Kaniz
Community Manager
Community Manager

Hi @Bhagwan Chaubey​ , Once you've uploaded your files to your blob container,

Step 1: Get credentials necessary for databricks to connect to your blob container

From your Azure portal, you need to navigate to all resources then select your blob storage account and from under the settings select account keys. Once there, copy the key under Key1 to a local notepad.

Step 2: Configure DataBricks to read the file

To start reading the data, first, you need to configure your spark session to use credentials for your blob container. This can simply be done through the spark.conf.set command.

storage_account_name = 'nameofyourstorageaccount'
storage_account_access_key = 'thekeyfortheblobcontainer'
spark.conf.set('fs.azure.account.key.' + storage_account_name + '.blob.core.windows.net', storage_account_access_key)

Once done, we need to build the file path in the blob container and read the file as a Spark data frame.

blob_container = 'yourblobcontainername'
filePath = "wasbs://" + blob_container + "@" + storage_account_name + ".blob.core.windows.net/Sales/SalesFile.csv"
salesDf = spark.read.format("csv").load(filePath, inferSchema = True, header = True)

And congrats, we are done.

You can use the display command to have a sneak peek at our data.

Below is a snapshot of my code.

Screenshot 2022-01-06 at 5.24.47 PM 

Kaniz
Community Manager
Community Manager

Hi @Bhagwan Chaubey​ , Does this work for you? Do you have any further doubts? Were you able to execute the above commands and get the desired results? Please do let us know If you need help.

@Kaniz Fatma​  I am using your code. no any error. But data is still not showing

Geoff123
New Contributor III

Kaniz,

I kept getting "org.apache.spark.SparkSecurityException: [INSUFFICIENT_PERMISSIONS] Insufficient privileges:" with the same codes.  Do you know why?

Thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.