I have a cluster in which I have the required configuration to access an ADLS Gen2, and it works without any problems.
I want to access this storage using the Hadoop filesystem APIs. To achieve this, I am trying to get the Hadoop configuration from the active Spark session in a notebook. Please see the code below.
from pyspark.sql import SparkSession
from py4j.java_gateway import java_import
java_import(spark._jvm, 'org.apache.hadoop.fs.Path')
java_import(spark._jvm, 'org.apache.hadoop.fs.FileSystem')
hadoop_conf = spark._jsc.hadoopConfiguration()
adl_uri = "abfss://adl_container@adl_storage_account.dfs.core.windows.net/"
dir_path = spark._jvm.Path(adl_uri)
fs = dir_path.getFileSystem(hadoop_conf)
files = fs.listStatus(dir_path)
print(f"Paths in directory {adl_uri}:")
for file in files:
print(file.getPath())
When I run this code, I get the error below.
Py4JJavaError: An error occurred while calling o1131.getFileSystem. : Failure to initialize configuration for storage account adl_storage_account.dfs.core.windows.net: Invalid configuration value detected for fs.azure.account.key
So, I checked the related configuration value, and it seems its value is None.
Given that the Hadoop configuration works as expected when Spark tries to access the storage, it is certain that the configuration is set properly in the cluster configuration. But why can't I access this configuration through "spark._jsc.hadoopConfiguration()"? Is there another way to access the hadoop configuration including the ADLS Gen2 connector configuration?
PS: I know I can use dbutils.fs library to interact with the storage. I have my reasons for not using it.
PS2: If I create a new Hadoop configuration instance and set the required configuration for it, the above code works as expected. But I do not want to create a new Hadoop configuration instance when I have already one with the required authentication info.
java_import(spark._jvm, 'org.apache.hadoop.conf.Configuration')
hadoop_conf = spark._jvm.Configuration()
storage_account_name = "adl_storage_account"
hadoop_conf.set(
"fs.azure.account.auth.type.{0}.dfs.core.windows.net".format(storage_account_name),
"OAuth"
)
hadoop_conf.set(
"fs.azure.account.oauth.provider.type.{0}.dfs.core.windows.net".format(storage_account_name),
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"
)
hadoop_conf.set(
"fs.azure.account.oauth2.client.id.{0}.dfs.core.windows.net".format(storage_account_name),
dbutils.secrets.get("my_secret_scope", "client_id")
)
hadoop_conf.set(
"fs.azure.account.oauth2.client.secret.{0}.dfs.core.windows.net".format(storage_account_name),
dbutils.secrets.get("my_secret_scope", "client_secret")
)
hadoop_conf.set(
"fs.azure.account.oauth2.client.endpoint.{0}.dfs.core.windows.net".format(storage_account_name),
dbutils.secrets.get("my_secret_scope", "client_endpoint")
)
hadoop_conf.set(
"fs.azure.account.key.{0}.dfs.core.windows.net".format(storage_account_name),
dbutils.secrets.get("my_secret_scope", "key")
)