topic Error with Read XML data using the spark-xml library in Get Started Discussions

Error with Read XML data using the spark-xml library

citizenX7042 — Mon, 20 Jan 2025 10:48:49 GMT

hi, would appritiate any help with an error with loading an XML file with spark-xml library.

my enviorment :
14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)
library : com.databricks:spark-xml_2.12:0.15.0
on databricks notebook.

when running this script :

from pyspark.sql.functions import regexp_extract, input_file_name

print(single_file)

# Load the single file

raw_df_single = (

spark.read.format("com.databricks.spark.xml") # XML format

.option("rowTag", "Card") # Specify the row tag for parsing

.load(single_file) # Load the single file

.withColumn("@FileName", regexp_extract(input_file_name(), r"([^/]+)$", 1)) # Extract file name

)

# Show a preview of the data

raw_df_single.show()

i get an error :

Py4JJavaError: An error occurred while calling o621.load. : Failure to initialize configuration for storage account [REDACTED].dfs.core.windows.net: Invalid configuration value detected for fs.azure.account.keyInvalid configuration value detected for fs.azure.account.key

the print for single_file : abfss://external-sources@[REDACTED].dfs.core.windows.net/***/***/*/testfile.xml

it was tested and there is a file like that in the blob.

can library connect directly to the blob?
what is the format for that and the best practice?

Re: Error with Read XML data using the spark-xml library

Alberto_Umana — Mon, 20 Jan 2025 12:37:24 GMT

Hi @citizenX7042,

Since the error indicates an issue with the configuration value for fs.azure.account.key

Can you test with the below code:

from pyspark.sql.functions import regexp_extract, input_file_name

# Set the storage account key

spark.conf.set("fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net", "<your-storage-account-key>")

# Define the file path

single_file = "abfss://external-sources@<your-storage-account-name>.dfs.core.windows.net/Bronze/Tribe_Report/20241210/visa-10079563/cards-11-15967860899208-10079563-20241210.xml"

# Load the single file

raw_df_single = (

spark.read.format("com.databricks.spark.xml") # XML format

.option("rowTag", "Card") # Specify the row tag for parsing

.load(single_file) # Load the single file

.withColumn("@FileName", regexp_extract(input_file_name(), r"([^/]+)$", 1)) # Extract file name

)

# Show a preview of the data

raw_df_single.show()

Re: Error with Read XML data using the spark-xml library

Alberto_Umana — Mon, 20 Jan 2025 12:39:20 GMT

Please refer to: https://learn.microsoft.com/en-us/azure/databricks/connect/storage/azure-storage

Re: Error with Read XML data using the spark-xml library

barsha_sharma — Wed, 29 Jan 2025 09:37:33 GMT

Hi @Alberto_Umana , I am facing the same issue. It works when i try to read the xml file as text using spark.read.text(), but fails when I try to read it in xml format. I'm authenticating using spn and the config is correct as i'm able to read json files from the same folder and also the xml file in text as mentioned.

Also it works if i use the mounted path to the file and not when i use the abfss path.

Could it be an issue with the spark-xml library not being able to work directly with abfss?

I have the following installed in my cluster: com.databricks:spark-xml_2.12:0.15.0

Re: Error with Read XML data using the spark-xml library

barsha_sharma — Tue, 18 Feb 2025 10:52:25 GMT

UPDATE:

It is now possible to read xml files directly: https://docs.databricks.com/en/query/formats/xml.html

Make sure to update your Databricks Runtime to 14.3 and above, and remove the spark-xml maven library from your cluster.