Databricks Community

Leo_138525 · ‎09-15-2022

I want to open some CSV files as an RDD, do some processing and then load it as a DataFrame. Since the files are stored in an Azure blob storage account I need to configure the access accordingly, which for some reason does not work when using an RDD. So I configure the access this way:

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

This works when loading the files directly as a DataFrame, but not when using the RDD API:

# This works with the previously set configuration
df = spark.read.format('csv').load('abfss://some/path/file.csv')
 
# This does not work and an error is thrown
rdd = spark.sparkContext.textFile('abfss://some/path/file.csv')
df = rdd.filter(filter_func).map(map_func).toDF()

The error I get is:

Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key

Why does the access configuration work when loading the files directly and not via a RDD? And how do I solve this problem?

data-guy · ‎09-22-2022

Hello!

I got the same error a few days ago and I resolved it with this post that I found:

Accessing ADLS Gen 2 with RDD | Data Engineering (data-engineering.wiki)

Basically, the key is to setup the properties to "hadoop" using spark.sparkContext.hadoopConfiguration.set(...)

I hope you solve your problem!

View solution in original post

data-guy · ‎09-22-2022

Hello!

I got the same error a few days ago and I resolved it with this post that I found:

Accessing ADLS Gen 2 with RDD | Data Engineering (data-engineering.wiki)

Basically, the key is to setup the properties to "hadoop" using spark.sparkContext.hadoopConfiguration.set(...)

I hope you solve your problem!

JerryK · ‎09-26-2023

Is there an explanation for why this behavior has changed?

In the past on Azure Databricks, one could add to the Spark config in the Advanced options of a cluster's Configuration tab a configuration parameter like:

fs.azure.account.key.BLOB_CONTAINER_NAME.dfs.core.windows.net

and the value of a suitable ADLS Gen 2 account key and RDDs would just work without one having to call configuration setting methods on the SparkContext associated with the Spark session in a job or notebook?

shan_chandra · ‎09-27-2022

@Leo Baudrexel - could you please check if the service principal is having the correct permissions to access the storage account?

please make sure that service principal is having the "contributor" or "storage blob data contributor" role on the storage account.

Leo_138525 · ‎09-28-2022

I decided to load the files into a DataFrame with a single column and then do the processing before splitting it into separate columns and this works just fine.

@Hyper Guy thanks for the link, I didn't try that but it seems like it would resolve the issue.

Databricks Community

RDD not picking up spark configuration for azure storage account access

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences