RDD not picking up spark configuration for azure s...

Leo_138525 · ‎09-15-2022

I want to open some CSV files as an RDD, do some processing and then load it as a DataFrame. Since the files are stored in an Azure blob storage account I need to configure the access accordingly, which for some reason does not work when using an RDD. So I configure the access this way:

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

This works when loading the files directly as a DataFrame, but not when using the RDD API:

# This works with the previously set configuration
df = spark.read.format('csv').load('abfss://some/path/file.csv')
 
# This does not work and an error is thrown
rdd = spark.sparkContext.textFile('abfss://some/path/file.csv')
df = rdd.filter(filter_func).map(map_func).toDF()

The error I get is:

Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key

Why does the access configuration work when loading the files directly and not via a RDD? And how do I solve this problem?

RDD not picking up spark configuration for azure storage account access