cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Accessing ADLS Gen2 related Hadoop configuration in notebook

Ender
New Contributor II

I have a cluster in which I have the required configuration to access an ADLS Gen2, and it works without any problems.
Ender_1-1717335940727.png

I want to access this storage using the Hadoop filesystem APIs. To achieve this, I am trying to get the Hadoop configuration from the active Spark session in a notebook. Please see the code below.

 

from pyspark.sql import SparkSession
from py4j.java_gateway import java_import

java_import(spark._jvm, 'org.apache.hadoop.fs.Path')
java_import(spark._jvm, 'org.apache.hadoop.fs.FileSystem')

hadoop_conf = spark._jsc.hadoopConfiguration()
adl_uri = "abfss://adl_container@adl_storage_account.dfs.core.windows.net/"
dir_path = spark._jvm.Path(adl_uri)
fs = dir_path.getFileSystem(hadoop_conf)
files = fs.listStatus(dir_path)
print(f"Paths in directory {adl_uri}:")
for file in files:
    print(file.getPath())

 

When I run this code, I get the error below.

 

Py4JJavaError: An error occurred while calling o1131.getFileSystem. : Failure to initialize configuration for storage account adl_storage_account.dfs.core.windows.net: Invalid configuration value detected for fs.azure.account.key

 

So, I checked the related configuration value, and it seems its value is None.

Ender_3-1717336689720.png

Given that the Hadoop configuration works as expected when Spark tries to access the storage, it is certain that the configuration is set properly in the cluster configuration. But why can't I access this configuration through "spark._jsc.hadoopConfiguration()"? Is there another way to access the hadoop configuration including the ADLS Gen2 connector configuration?

PS: I know I can use dbutils.fs library to interact with the storage. I have my reasons for not using it.

PS2: If I create a new Hadoop configuration instance and set the required configuration for it, the above code works as expected. But I do not want to create a new Hadoop configuration instance when I have already one with the required authentication info.

 

java_import(spark._jvm, 'org.apache.hadoop.conf.Configuration')

hadoop_conf = spark._jvm.Configuration()
storage_account_name = "adl_storage_account"
hadoop_conf.set(
    "fs.azure.account.auth.type.{0}.dfs.core.windows.net".format(storage_account_name),
    "OAuth"
)
hadoop_conf.set(
    "fs.azure.account.oauth.provider.type.{0}.dfs.core.windows.net".format(storage_account_name),
    "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"
)
hadoop_conf.set(
    "fs.azure.account.oauth2.client.id.{0}.dfs.core.windows.net".format(storage_account_name),
    dbutils.secrets.get("my_secret_scope", "client_id")
)
hadoop_conf.set(
    "fs.azure.account.oauth2.client.secret.{0}.dfs.core.windows.net".format(storage_account_name),
    dbutils.secrets.get("my_secret_scope", "client_secret")
)
hadoop_conf.set(
    "fs.azure.account.oauth2.client.endpoint.{0}.dfs.core.windows.net".format(storage_account_name),
    dbutils.secrets.get("my_secret_scope", "client_endpoint")
)
hadoop_conf.set(
    "fs.azure.account.key.{0}.dfs.core.windows.net".format(storage_account_name),
    dbutils.secrets.get("my_secret_scope", "key")
)

 

 

2 REPLIES 2

Ender
New Contributor II

Hello @Retired_mod,

Thank you, `spark.sparkContext.getConf().getAll()` gets the configuration that I need. But it also includes configuration which are not related with Hadoop. In that case, I assume there is no better way to get the complete Hadoop configuration.

The `Path.getFileSystem(hadoop_conf)` method requires a `spark._jvm.Configuration` instance, so in that case I will have to create a new instance of `spark._jvm.Configuration()`, filter the desired configuration manually from entire Spark configuration dictionary, fill the `Configuration` instance with the filtered configuration and provide the method with it. But at least I have a working method now 🙂

Thank you very much for your help!

Ender
New Contributor II

By the way how do you achieve inline code highlighting in the editor 🙂 I tried `` but it didn't work.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group