06-02-2024 07:03 AM - edited 06-02-2024 07:16 AM
I have a cluster in which I have the required configuration to access an ADLS Gen2, and it works without any problems.
I want to access this storage using the Hadoop filesystem APIs. To achieve this, I am trying to get the Hadoop configuration from the active Spark session in a notebook. Please see the code below.
from pyspark.sql import SparkSession
from py4j.java_gateway import java_import
java_import(spark._jvm, 'org.apache.hadoop.fs.Path')
java_import(spark._jvm, 'org.apache.hadoop.fs.FileSystem')
hadoop_conf = spark._jsc.hadoopConfiguration()
adl_uri = "abfss://adl_container@adl_storage_account.dfs.core.windows.net/"
dir_path = spark._jvm.Path(adl_uri)
fs = dir_path.getFileSystem(hadoop_conf)
files = fs.listStatus(dir_path)
print(f"Paths in directory {adl_uri}:")
for file in files:
print(file.getPath())
When I run this code, I get the error below.
Py4JJavaError: An error occurred while calling o1131.getFileSystem. : Failure to initialize configuration for storage account adl_storage_account.dfs.core.windows.net: Invalid configuration value detected for fs.azure.account.key
So, I checked the related configuration value, and it seems its value is None.
Given that the Hadoop configuration works as expected when Spark tries to access the storage, it is certain that the configuration is set properly in the cluster configuration. But why can't I access this configuration through "spark._jsc.hadoopConfiguration()"? Is there another way to access the hadoop configuration including the ADLS Gen2 connector configuration?
PS: I know I can use dbutils.fs library to interact with the storage. I have my reasons for not using it.
PS2: If I create a new Hadoop configuration instance and set the required configuration for it, the above code works as expected. But I do not want to create a new Hadoop configuration instance when I have already one with the required authentication info.
java_import(spark._jvm, 'org.apache.hadoop.conf.Configuration')
hadoop_conf = spark._jvm.Configuration()
storage_account_name = "adl_storage_account"
hadoop_conf.set(
"fs.azure.account.auth.type.{0}.dfs.core.windows.net".format(storage_account_name),
"OAuth"
)
hadoop_conf.set(
"fs.azure.account.oauth.provider.type.{0}.dfs.core.windows.net".format(storage_account_name),
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"
)
hadoop_conf.set(
"fs.azure.account.oauth2.client.id.{0}.dfs.core.windows.net".format(storage_account_name),
dbutils.secrets.get("my_secret_scope", "client_id")
)
hadoop_conf.set(
"fs.azure.account.oauth2.client.secret.{0}.dfs.core.windows.net".format(storage_account_name),
dbutils.secrets.get("my_secret_scope", "client_secret")
)
hadoop_conf.set(
"fs.azure.account.oauth2.client.endpoint.{0}.dfs.core.windows.net".format(storage_account_name),
dbutils.secrets.get("my_secret_scope", "client_endpoint")
)
hadoop_conf.set(
"fs.azure.account.key.{0}.dfs.core.windows.net".format(storage_account_name),
dbutils.secrets.get("my_secret_scope", "key")
)
06-03-2024 03:30 AM - edited 06-03-2024 03:31 AM
Hi @Ender,
First, let’s address why you’re getting the error related to the fs.azure.account.key
configuration value. The error message indicates that the configuration value for your ADLS Gen2 storage account is invalid. However, you’ve confirmed that the configuration works as expected when Spark accesses the storage. So, why can’t you access this configuration through spark._jsc.hadoopConfiguration()
?
The reason lies in how Spark manages its configuration. When you use spark._jsc.hadoopConfiguration()
, it provides access to the Hadoop configuration settings specific to Spark. However, it doesn’t include all the configuration values that you might expect, especially those related to external services like ADLS Gen2.
To access the complete Hadoop configuration, including the ADLS Gen2 connector configuration, you’ll need to take a different approach. Fortunately, there are ways to achieve this:
Using spark.sparkContext.getConf().getAll()
: You can retrieve all the Spark configuration settings (including defaults) using the following code snippet:
spark_conf = spark.sparkContext.getConf().getAll()
This will give you a dictionary with all the configured settings, including those related to Hadoop. Keep in mind that only values explicitly specified through spark-defaults.conf
, SparkConf
, or the co...
Setting Hadoop Configuration Properties: If you specifically want to set Hadoop configuration properties for your Spark session, you can do so using the spark.sparkContext.hadoopConfiguration.set()
method.
For example:
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.auth.type.adl_storage_account.dfs.core.windows.net", "OAuth")
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth.provider.type.adl_storage_account.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
# Set other required properties...
Replace the placeholders with the actual values you need. Note that you can set any Hadoop properties using the --conf
parameter while submitting the job.
Creating a New Hadoop Configuration Instance: You mentioned that creating a new Hadoop configuration instance works as expected. While it’s not ideal to duplicate configurations, if this approach solves your problem, you can continue using it. Just be aware of the redundancy.
Remember that the dbutils.fs
library is another option for interacting with storage, but I understand you have your reasons for not using it. Feel free to explore the above approaches, and let me know if you need further assistance!
PS: It’s great that you’re already aware of the dbutils.fs
library and its capabilities!
If you have any more questions or need additional help, feel free to ask.
06-03-2024 03:30 AM - edited 06-03-2024 03:31 AM
Hi @Ender,
First, let’s address why you’re getting the error related to the fs.azure.account.key
configuration value. The error message indicates that the configuration value for your ADLS Gen2 storage account is invalid. However, you’ve confirmed that the configuration works as expected when Spark accesses the storage. So, why can’t you access this configuration through spark._jsc.hadoopConfiguration()
?
The reason lies in how Spark manages its configuration. When you use spark._jsc.hadoopConfiguration()
, it provides access to the Hadoop configuration settings specific to Spark. However, it doesn’t include all the configuration values that you might expect, especially those related to external services like ADLS Gen2.
To access the complete Hadoop configuration, including the ADLS Gen2 connector configuration, you’ll need to take a different approach. Fortunately, there are ways to achieve this:
Using spark.sparkContext.getConf().getAll()
: You can retrieve all the Spark configuration settings (including defaults) using the following code snippet:
spark_conf = spark.sparkContext.getConf().getAll()
This will give you a dictionary with all the configured settings, including those related to Hadoop. Keep in mind that only values explicitly specified through spark-defaults.conf
, SparkConf
, or the co...
Setting Hadoop Configuration Properties: If you specifically want to set Hadoop configuration properties for your Spark session, you can do so using the spark.sparkContext.hadoopConfiguration.set()
method.
For example:
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.auth.type.adl_storage_account.dfs.core.windows.net", "OAuth")
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth.provider.type.adl_storage_account.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
# Set other required properties...
Replace the placeholders with the actual values you need. Note that you can set any Hadoop properties using the --conf
parameter while submitting the job.
Creating a New Hadoop Configuration Instance: You mentioned that creating a new Hadoop configuration instance works as expected. While it’s not ideal to duplicate configurations, if this approach solves your problem, you can continue using it. Just be aware of the redundancy.
Remember that the dbutils.fs
library is another option for interacting with storage, but I understand you have your reasons for not using it. Feel free to explore the above approaches, and let me know if you need further assistance!
PS: It’s great that you’re already aware of the dbutils.fs
library and its capabilities!
If you have any more questions or need additional help, feel free to ask.
06-06-2024 10:10 AM - edited 06-06-2024 10:27 AM
Hello @Kaniz_Fatma,
Thank you, `spark.sparkContext.getConf().getAll()` gets the configuration that I need. But it also includes configuration which are not related with Hadoop. In that case, I assume there is no better way to get the complete Hadoop configuration.
The `Path.getFileSystem(hadoop_conf)` method requires a `spark._jvm.Configuration` instance, so in that case I will have to create a new instance of `spark._jvm.Configuration()`, filter the desired configuration manually from entire Spark configuration dictionary, fill the `Configuration` instance with the filtered configuration and provide the method with it. But at least I have a working method now 🙂
Thank you very much for your help!
06-06-2024 10:28 AM
By the way how do you achieve inline code highlighting in the editor 🙂 I tried `` but it didn't work.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group