Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-15-2022 12:38 AM
I want to open some CSV files as an RDD, do some processing and then load it as a DataFrame. Since the files are stored in an Azure blob storage account I need to configure the access accordingly, which for some reason does not work when using an RDD. So I configure the access this way:
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")This works when loading the files directly as a DataFrame, but not when using the RDD API:
# This works with the previously set configuration
df = spark.read.format('csv').load('abfss://some/path/file.csv')
# This does not work and an error is thrown
rdd = spark.sparkContext.textFile('abfss://some/path/file.csv')
df = rdd.filter(filter_func).map(map_func).toDF()The error I get is:
Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key
Why does the access configuration work when loading the files directly and not via a RDD? And how do I solve this problem?
Labels:
- Labels:
-
Spark Configuration