cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

RDD not picking up spark configuration for azure storage account access

Leo_138525
New Contributor II

I want to open some CSV files as an RDD, do some processing and then load it as a DataFrame. Since the files are stored in an Azure blob storage account I need to configure the access accordingly, which for some reason does not work when using an RDD. So I configure the access this way:

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

This works when loading the files directly as a DataFrame, but not when using the RDD API:

# This works with the previously set configuration
df = spark.read.format('csv').load('abfss://some/path/file.csv')
 
# This does not work and an error is thrown
rdd = spark.sparkContext.textFile('abfss://some/path/file.csv')
df = rdd.filter(filter_func).map(map_func).toDF()

The error I get is:

Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key

Why does the access configuration work when loading the files directly and not via a RDD? And how do I solve this problem?

1 ACCEPTED SOLUTION

Accepted Solutions

data-guy
New Contributor III

Hello!

I got the same error a few days ago and I resolved it with this post that I found:

Accessing ADLS Gen 2 with RDD | Data Engineering (data-engineering.wiki)

Basically, the key is to setup the properties to "hadoop" using spark.sparkContext.hadoopConfiguration.set(...)

I hope you solve your problem!

View solution in original post

4 REPLIES 4

data-guy
New Contributor III

Hello!

I got the same error a few days ago and I resolved it with this post that I found:

Accessing ADLS Gen 2 with RDD | Data Engineering (data-engineering.wiki)

Basically, the key is to setup the properties to "hadoop" using spark.sparkContext.hadoopConfiguration.set(...)

I hope you solve your problem!

JerryK
New Contributor II

Is there an explanation for why this behavior has changed?

In the past on Azure Databricks, one could add to the Spark config in the Advanced options of a cluster's Configuration tab a configuration parameter like:

fs.azure.account.key.BLOB_CONTAINER_NAME.dfs.core.windows.net

and the value of a suitable ADLS Gen 2 account key and RDDs would just work without one having to call configuration setting methods on the SparkContext associated with the Spark session in a job or notebook?

shan_chandra
Esteemed Contributor
Esteemed Contributor

@Leo Baudrexel​ - could you please check if the service principal is having the correct permissions to access the storage account?

please make sure that service principal is having the "contributor" or "storage blob data contributor" role on the storage account.

Leo_138525
New Contributor II

I decided to load the files into a DataFrame with a single column and then do the processing before splitting it into separate columns and this works just fine.

@Hyper Guy​ thanks for the link, I didn't try that but it seems like it would resolve the issue.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!