Accessing ADLS Gen2 related Hadoop configuration in notebook
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-02-2024 07:03 AM - edited 06-02-2024 07:16 AM
I have a cluster in which I have the required configuration to access an ADLS Gen2, and it works without any problems.
I want to access this storage using the Hadoop filesystem APIs. To achieve this, I am trying to get the Hadoop configuration from the active Spark session in a notebook. Please see the code below.
from pyspark.sql import SparkSession
from py4j.java_gateway import java_import
java_import(spark._jvm, 'org.apache.hadoop.fs.Path')
java_import(spark._jvm, 'org.apache.hadoop.fs.FileSystem')
hadoop_conf = spark._jsc.hadoopConfiguration()
adl_uri = "abfss://adl_container@adl_storage_account.dfs.core.windows.net/"
dir_path = spark._jvm.Path(adl_uri)
fs = dir_path.getFileSystem(hadoop_conf)
files = fs.listStatus(dir_path)
print(f"Paths in directory {adl_uri}:")
for file in files:
print(file.getPath())
When I run this code, I get the error below.
Py4JJavaError: An error occurred while calling o1131.getFileSystem. : Failure to initialize configuration for storage account adl_storage_account.dfs.core.windows.net: Invalid configuration value detected for fs.azure.account.key
So, I checked the related configuration value, and it seems its value is None.
Given that the Hadoop configuration works as expected when Spark tries to access the storage, it is certain that the configuration is set properly in the cluster configuration. But why can't I access this configuration through "spark._jsc.hadoopConfiguration()"? Is there another way to access the hadoop configuration including the ADLS Gen2 connector configuration?
PS: I know I can use dbutils.fs library to interact with the storage. I have my reasons for not using it.
PS2: If I create a new Hadoop configuration instance and set the required configuration for it, the above code works as expected. But I do not want to create a new Hadoop configuration instance when I have already one with the required authentication info.
java_import(spark._jvm, 'org.apache.hadoop.conf.Configuration')
hadoop_conf = spark._jvm.Configuration()
storage_account_name = "adl_storage_account"
hadoop_conf.set(
"fs.azure.account.auth.type.{0}.dfs.core.windows.net".format(storage_account_name),
"OAuth"
)
hadoop_conf.set(
"fs.azure.account.oauth.provider.type.{0}.dfs.core.windows.net".format(storage_account_name),
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"
)
hadoop_conf.set(
"fs.azure.account.oauth2.client.id.{0}.dfs.core.windows.net".format(storage_account_name),
dbutils.secrets.get("my_secret_scope", "client_id")
)
hadoop_conf.set(
"fs.azure.account.oauth2.client.secret.{0}.dfs.core.windows.net".format(storage_account_name),
dbutils.secrets.get("my_secret_scope", "client_secret")
)
hadoop_conf.set(
"fs.azure.account.oauth2.client.endpoint.{0}.dfs.core.windows.net".format(storage_account_name),
dbutils.secrets.get("my_secret_scope", "client_endpoint")
)
hadoop_conf.set(
"fs.azure.account.key.{0}.dfs.core.windows.net".format(storage_account_name),
dbutils.secrets.get("my_secret_scope", "key")
)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-06-2024 10:10 AM - edited 06-06-2024 10:27 AM
Hello @Retired_mod,
Thank you, `spark.sparkContext.getConf().getAll()` gets the configuration that I need. But it also includes configuration which are not related with Hadoop. In that case, I assume there is no better way to get the complete Hadoop configuration.
The `Path.getFileSystem(hadoop_conf)` method requires a `spark._jvm.Configuration` instance, so in that case I will have to create a new instance of `spark._jvm.Configuration()`, filter the desired configuration manually from entire Spark configuration dictionary, fill the `Configuration` instance with the filtered configuration and provide the method with it. But at least I have a working method now 🙂
Thank you very much for your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-06-2024 10:28 AM
By the way how do you achieve inline code highlighting in the editor 🙂 I tried `` but it didn't work.