Databricks Community

Ender · ‎06-02-2024

I have a cluster in which I have the required configuration to access an ADLS Gen2, and it works without any problems.

I want to access this storage using the Hadoop filesystem APIs. To achieve this, I am trying to get the Hadoop configuration from the active Spark session in a notebook. Please see the code below.

from pyspark.sql import SparkSession
from py4j.java_gateway import java_import

java_import(spark._jvm, 'org.apache.hadoop.fs.Path')
java_import(spark._jvm, 'org.apache.hadoop.fs.FileSystem')

hadoop_conf = spark._jsc.hadoopConfiguration()
adl_uri = "abfss://adl_container@adl_storage_account.dfs.core.windows.net/"
dir_path = spark._jvm.Path(adl_uri)
fs = dir_path.getFileSystem(hadoop_conf)
files = fs.listStatus(dir_path)
print(f"Paths in directory {adl_uri}:")
for file in files:
    print(file.getPath())

When I run this code, I get the error below.

Py4JJavaError: An error occurred while calling o1131.getFileSystem. : Failure to initialize configuration for storage account adl_storage_account.dfs.core.windows.net: Invalid configuration value detected for fs.azure.account.key

So, I checked the related configuration value, and it seems its value is None.

Given that the Hadoop configuration works as expected when Spark tries to access the storage, it is certain that the configuration is set properly in the cluster configuration. But why can't I access this configuration through "spark._jsc.hadoopConfiguration()"? Is there another way to access the hadoop configuration including the ADLS Gen2 connector configuration?

PS: I know I can use dbutils.fs library to interact with the storage. I have my reasons for not using it.

PS2: If I create a new Hadoop configuration instance and set the required configuration for it, the above code works as expected. But I do not want to create a new Hadoop configuration instance when I have already one with the required authentication info.

java_import(spark._jvm, 'org.apache.hadoop.conf.Configuration')

hadoop_conf = spark._jvm.Configuration()
storage_account_name = "adl_storage_account"
hadoop_conf.set(
    "fs.azure.account.auth.type.{0}.dfs.core.windows.net".format(storage_account_name),
    "OAuth"
)
hadoop_conf.set(
    "fs.azure.account.oauth.provider.type.{0}.dfs.core.windows.net".format(storage_account_name),
    "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"
)
hadoop_conf.set(
    "fs.azure.account.oauth2.client.id.{0}.dfs.core.windows.net".format(storage_account_name),
    dbutils.secrets.get("my_secret_scope", "client_id")
)
hadoop_conf.set(
    "fs.azure.account.oauth2.client.secret.{0}.dfs.core.windows.net".format(storage_account_name),
    dbutils.secrets.get("my_secret_scope", "client_secret")
)
hadoop_conf.set(
    "fs.azure.account.oauth2.client.endpoint.{0}.dfs.core.windows.net".format(storage_account_name),
    dbutils.secrets.get("my_secret_scope", "client_endpoint")
)
hadoop_conf.set(
    "fs.azure.account.key.{0}.dfs.core.windows.net".format(storage_account_name),
    dbutils.secrets.get("my_secret_scope", "key")
)

Ender · ‎06-06-2024

Hello @Retired_mod,

Thank you, `spark.sparkContext.getConf().getAll()` gets the configuration that I need. But it also includes configuration which are not related with Hadoop. In that case, I assume there is no better way to get the complete Hadoop configuration.

The `Path.getFileSystem(hadoop_conf)` method requires a `spark._jvm.Configuration` instance, so in that case I will have to create a new instance of `spark._jvm.Configuration()`, filter the desired configuration manually from entire Spark configuration dictionary, fill the `Configuration` instance with the filtered configuration and provide the method with it. But at least I have a working method now 🙂

Thank you very much for your help!