cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Error with Read XML data using the spark-xml library

citizenX7042
New Contributor

hi, would appritiate any help with an error with loading an XML file with  spark-xml library.

my enviorment :
14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)
library : com.databricks:spark-xml_2.12:0.15.0
on databricks notebook.

when running this script : 

from pyspark.sql.functions import regexp_extract, input_file_name
print(single_file)
# Load the single file
raw_df_single = (
    spark.read.format("com.databricks.spark.xml")  # XML format
    .option("rowTag", "Card")                     # Specify the row tag for parsing
    .load(single_file)                            # Load the single file
    .withColumn("@FileName", regexp_extract(input_file_name(), r"([^/]+)$", 1))  # Extract file name
)

# Show a preview of the data
raw_df_single.show()
i get an error :

Py4JJavaError: An error occurred while calling o621.load. : Failure to initialize configuration for storage account [REDACTED].dfs.core.windows.net: Invalid configuration value detected for fs.azure.account.keyInvalid configuration value detected for fs.azure.account.key
 
the print for single_file : abfss://external-sources@[REDACTED].dfs.core.windows.net/***/***/*/testfile.xml

it was tested and there is a file like that in the blob.

can library connect directly to the blob?
what is the format for that and the best practice?


3 REPLIES 3

Alberto_Umana
Databricks Employee
Databricks Employee

Hi @citizenX7042,

Since the error indicates an issue with the configuration value for fs.azure.account.key

Can you test with the below code:

 

from pyspark.sql.functions import regexp_extract, input_file_name

 

# Set the storage account key

spark.conf.set("fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net", "<your-storage-account-key>")

 

# Define the file path

single_file = "abfss://external-sources@<your-storage-account-name>.dfs.core.windows.net/Bronze/Tribe_Report/20241210/visa-10079563/cards-11-15967860899208-10079563-20241210.xml"

 

# Load the single file

raw_df_single = (

    spark.read.format("com.databricks.spark.xml")  # XML format

    .option("rowTag", "Card")                     # Specify the row tag for parsing

    .load(single_file)                            # Load the single file

    .withColumn("@FileName", regexp_extract(input_file_name(), r"([^/]+)$", 1))  # Extract file name

)

 

# Show a preview of the data

raw_df_single.show()

Alberto_Umana
Databricks Employee
Databricks Employee

barsha_sharma
New Contributor II

Hi @Alberto_Umana , I am facing the same issue. It works when i try to read the xml file as text using spark.read.text(), but fails when I try to read it in xml format. I'm authenticating using spn and the config is correct as i'm able to read json files from the same folder and also the xml file in text as mentioned.

Also it works if i use the mounted path to the file and not when i use the abfss path.

Could it be an issue with the spark-xml library not being able to work directly with abfss? 

I have the following installed in my cluster: com.databricks:spark-xml_2.12:0.15.0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group