09-15-2022 12:38 AM
I want to open some CSV files as an RDD, do some processing and then load it as a DataFrame. Since the files are stored in an Azure blob storage account I need to configure the access accordingly, which for some reason does not work when using an RDD. So I configure the access this way:
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
This works when loading the files directly as a DataFrame, but not when using the RDD API:
# This works with the previously set configuration
df = spark.read.format('csv').load('abfss://some/path/file.csv')
# This does not work and an error is thrown
rdd = spark.sparkContext.textFile('abfss://some/path/file.csv')
df = rdd.filter(filter_func).map(map_func).toDF()
The error I get is:
Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key
Why does the access configuration work when loading the files directly and not via a RDD? And how do I solve this problem?
09-22-2022 06:28 AM
Hello!
I got the same error a few days ago and I resolved it with this post that I found:
Accessing ADLS Gen 2 with RDD | Data Engineering (data-engineering.wiki)
Basically, the key is to setup the properties to "hadoop" using spark.sparkContext.hadoopConfiguration.set(...)
I hope you solve your problem!
09-22-2022 06:28 AM
Hello!
I got the same error a few days ago and I resolved it with this post that I found:
Accessing ADLS Gen 2 with RDD | Data Engineering (data-engineering.wiki)
Basically, the key is to setup the properties to "hadoop" using spark.sparkContext.hadoopConfiguration.set(...)
I hope you solve your problem!
09-26-2023 01:41 PM
Is there an explanation for why this behavior has changed?
In the past on Azure Databricks, one could add to the Spark config in the Advanced options of a cluster's Configuration tab a configuration parameter like:
fs.azure.account.key.BLOB_CONTAINER_NAME.dfs.core.windows.net
and the value of a suitable ADLS Gen 2 account key and RDDs would just work without one having to call configuration setting methods on the SparkContext associated with the Spark session in a job or notebook?
09-27-2022 09:23 AM
@Leo Baudrexel - could you please check if the service principal is having the correct permissions to access the storage account?
please make sure that service principal is having the "contributor" or "storage blob data contributor" role on the storage account.
09-28-2022 12:27 AM
I decided to load the files into a DataFrame with a single column and then do the processing before splitting it into separate columns and this works just fine.
@Hyper Guy thanks for the link, I didn't try that but it seems like it would resolve the issue.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group