Databricks Community

citizenX7042 · ‎01-20-2025

hi, would appritiate any help with an error with loading an XML file with spark-xml library.

my enviorment :
14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)
library : com.databricks:spark-xml_2.12:0.15.0
on databricks notebook.

when running this script :

from pyspark.sql.functions import regexp_extract, input_file_name

print(single_file)

# Load the single file

raw_df_single = (

spark.read.format("com.databricks.spark.xml") # XML format

.option("rowTag", "Card") # Specify the row tag for parsing

.load(single_file) # Load the single file

.withColumn("@FileName", regexp_extract(input_file_name(), r"([^/]+)$", 1)) # Extract file name

)

# Show a preview of the data

raw_df_single.show()

i get an error :

Py4JJavaError: An error occurred while calling o621.load. : Failure to initialize configuration for storage account [REDACTED].dfs.core.windows.net: Invalid configuration value detected for fs.azure.account.keyInvalid configuration value detected for fs.azure.account.key

the print for single_file : abfss://external-sources@[REDACTED].dfs.core.windows.net/***/***/*/testfile.xml

it was tested and there is a file like that in the blob.

can library connect directly to the blob?
what is the format for that and the best practice?

Alberto_Umana · ‎01-20-2025

Hi @citizenX7042,

Since the error indicates an issue with the configuration value for fs.azure.account.key

Can you test with the below code:

from pyspark.sql.functions import regexp_extract, input_file_name

# Set the storage account key

spark.conf.set("fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net", "<your-storage-account-key>")

# Define the file path

single_file = "abfss://external-sources@<your-storage-account-name>.dfs.core.windows.net/Bronze/Tribe_Report/20241210/visa-10079563/cards-11-15967860899208-10079563-20241210.xml"

# Load the single file

raw_df_single = (

spark.read.format("com.databricks.spark.xml") # XML format

.option("rowTag", "Card") # Specify the row tag for parsing

.load(single_file) # Load the single file

.withColumn("@FileName", regexp_extract(input_file_name(), r"([^/]+)$", 1)) # Extract file name

)

# Show a preview of the data

raw_df_single.show()

Alberto_Umana · ‎01-20-2025

Please refer to: https://learn.microsoft.com/en-us/azure/databricks/connect/storage/azure-storage

barsha_sharma · ‎01-29-2025

Hi @Alberto_Umana , I am facing the same issue. It works when i try to read the xml file as text using spark.read.text(), but fails when I try to read it in xml format. I'm authenticating using spn and the config is correct as i'm able to read json files from the same folder and also the xml file in text as mentioned.

Also it works if i use the mounted path to the file and not when i use the abfss path.

Could it be an issue with the spark-xml library not being able to work directly with abfss?

I have the following installed in my cluster: com.databricks:spark-xml_2.12:0.15.0