topic Re: RDD not picking up spark configuration for azure storage account access in Data Engineering

RDD not picking up spark configuration for azure storage account access

Leo_138525 — Thu, 15 Sep 2022 07:38:20 GMT

I want to open some CSV files as an RDD, do some processing and then load it as a DataFrame. Since the files are stored in an Azure blob storage account I need to configure the access accordingly, which for some reason does not work when using an RDD. So I configure the access this way:

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

This works when loading the files directly as a DataFrame, but not when using the RDD API:

# This works with the previously set configuration
df = spark.read.format('csv').load('abfss://some/path/file.csv')
 
# This does not work and an error is thrown
rdd = spark.sparkContext.textFile('abfss://some/path/file.csv')
df = rdd.filter(filter_func).map(map_func).toDF()

The error I get is:

Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key

Why does the access configuration work when loading the files directly and not via a RDD? And how do I solve this problem?

Re: RDD not picking up spark configuration for azure storage account access

data-guy — Thu, 22 Sep 2022 13:28:36 GMT

Hello!

I got the same error a few days ago and I resolved it with this post that I found:

Accessing ADLS Gen 2 with RDD | Data Engineering (data-engineering.wiki)

Basically, the key is to setup the properties to "hadoop" using spark.sparkContext.hadoopConfiguration.set(...)

I hope you solve your problem!

Re: RDD not picking up spark configuration for azure storage account access

shan_chandra — Tue, 27 Sep 2022 16:23:12 GMT

@Leo Baudrexel - could you please check if the service principal is having the correct permissions to access the storage account?

please make sure that service principal is having the "contributor" or "storage blob data contributor" role on the storage account.

Re: RDD not picking up spark configuration for azure storage account access

Leo_138525 — Wed, 28 Sep 2022 07:27:23 GMT

I decided to load the files into a DataFrame with a single column and then do the processing before splitting it into separate columns and this works just fine.

@Hyper Guy thanks for the link, I didn't try that but it seems like it would resolve the issue.

Re: RDD not picking up spark configuration for azure storage account access

JerryK — Tue, 26 Sep 2023 20:41:50 GMT

Is there an explanation for why this behavior has changed?

In the past on Azure Databricks, one could add to the Spark config in the Advanced options of a cluster's Configuration tab a configuration parameter like:

fs.azure.account.key.BLOB_CONTAINER_NAME.dfs.core.windows.net

and the value of a suitable ADLS Gen 2 account key and RDDs would just work without one having to call configuration setting methods on the SparkContext associated with the Spark session in a job or notebook?