<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Error with Read XML data using the spark-xml library in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/error-with-read-xml-data-using-the-spark-xml-library/m-p/106297#M9839</link>
    <description>&lt;P&gt;hi, would appritiate any help with an error with loading an XML file with&amp;nbsp;&amp;nbsp;spark-xml library.&lt;BR /&gt;&lt;BR /&gt;my enviorment :&lt;BR /&gt;&lt;SPAN&gt;14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)&lt;/SPAN&gt;&lt;BR /&gt;library : &lt;A target="_blank"&gt;com.databricks:spark-xml_2.12:0.15.0&lt;/A&gt;&lt;BR /&gt;on databricks notebook.&lt;BR /&gt;&lt;BR /&gt;when running this script :&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;from&lt;/SPAN&gt;&lt;SPAN&gt; pyspark.sql.functions &lt;/SPAN&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; regexp_extract, input_file_name&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;print&lt;/SPAN&gt;&lt;SPAN&gt;(single_file)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;# Load the single file&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;raw_df_single &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; (&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; spark.read.&lt;/SPAN&gt;&lt;SPAN&gt;format&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"com.databricks.spark.xml"&lt;/SPAN&gt;&lt;SPAN&gt;) &amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;# XML format&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"rowTag"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"Card"&lt;/SPAN&gt;&lt;SPAN&gt;) &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;# Specify the row tag for parsing&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;load&lt;/SPAN&gt;&lt;SPAN&gt;(single_file) &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;# Load the single file&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;withColumn&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"@FileName"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;regexp_extract&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;input_file_name&lt;/SPAN&gt;&lt;SPAN&gt;(), &lt;/SPAN&gt;&lt;SPAN&gt;r&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;([&lt;/SPAN&gt;&lt;SPAN&gt;^&lt;/SPAN&gt;&lt;SPAN&gt;/&lt;/SPAN&gt;&lt;SPAN&gt;]&lt;/SPAN&gt;&lt;SPAN&gt;+&lt;/SPAN&gt;&lt;SPAN&gt;)$&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;1&lt;/SPAN&gt;&lt;SPAN&gt;)) &amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;# Extract file name&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;# Show a preview of the data&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;raw_df_single.&lt;/SPAN&gt;&lt;SPAN&gt;show&lt;/SPAN&gt;&lt;SPAN&gt;()&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;i get an error :&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;Py4JJavaError: &lt;/SPAN&gt;An error occurred while calling o621.load. : Failure to initialize configuration for storage account [REDACTED].dfs.core.windows.net: Invalid configuration value detected for fs.azure.account.keyInvalid configuration value detected for fs.azure.account.key&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;the print for single_file :&amp;nbsp;&lt;A class="" href="abfss://external-sources@[REDACTED].dfs.core.windows.net/Bronze/Tribe_Report/20241210/visa-10079563/cards-11-15967860899208-10079563-20241210.xml" target="_blank" rel="noopener noreferrer"&gt;abfss://external-sources@[REDACTED].dfs.core.windows.net/***/***/*/testfile.xml&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;it was tested and there is a file like that in the blob.&lt;BR /&gt;&lt;BR /&gt;can library connect directly to the blob?&lt;BR /&gt;what is the format for that and the best practice?&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Mon, 20 Jan 2025 10:48:49 GMT</pubDate>
    <dc:creator>citizenX7042</dc:creator>
    <dc:date>2025-01-20T10:48:49Z</dc:date>
    <item>
      <title>Error with Read XML data using the spark-xml library</title>
      <link>https://community.databricks.com/t5/get-started-discussions/error-with-read-xml-data-using-the-spark-xml-library/m-p/106297#M9839</link>
      <description>&lt;P&gt;hi, would appritiate any help with an error with loading an XML file with&amp;nbsp;&amp;nbsp;spark-xml library.&lt;BR /&gt;&lt;BR /&gt;my enviorment :&lt;BR /&gt;&lt;SPAN&gt;14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)&lt;/SPAN&gt;&lt;BR /&gt;library : &lt;A target="_blank"&gt;com.databricks:spark-xml_2.12:0.15.0&lt;/A&gt;&lt;BR /&gt;on databricks notebook.&lt;BR /&gt;&lt;BR /&gt;when running this script :&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;from&lt;/SPAN&gt;&lt;SPAN&gt; pyspark.sql.functions &lt;/SPAN&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; regexp_extract, input_file_name&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;print&lt;/SPAN&gt;&lt;SPAN&gt;(single_file)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;# Load the single file&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;raw_df_single &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; (&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; spark.read.&lt;/SPAN&gt;&lt;SPAN&gt;format&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"com.databricks.spark.xml"&lt;/SPAN&gt;&lt;SPAN&gt;) &amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;# XML format&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"rowTag"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"Card"&lt;/SPAN&gt;&lt;SPAN&gt;) &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;# Specify the row tag for parsing&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;load&lt;/SPAN&gt;&lt;SPAN&gt;(single_file) &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;# Load the single file&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;withColumn&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"@FileName"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;regexp_extract&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;input_file_name&lt;/SPAN&gt;&lt;SPAN&gt;(), &lt;/SPAN&gt;&lt;SPAN&gt;r&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;([&lt;/SPAN&gt;&lt;SPAN&gt;^&lt;/SPAN&gt;&lt;SPAN&gt;/&lt;/SPAN&gt;&lt;SPAN&gt;]&lt;/SPAN&gt;&lt;SPAN&gt;+&lt;/SPAN&gt;&lt;SPAN&gt;)$&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;1&lt;/SPAN&gt;&lt;SPAN&gt;)) &amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;# Extract file name&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;# Show a preview of the data&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;raw_df_single.&lt;/SPAN&gt;&lt;SPAN&gt;show&lt;/SPAN&gt;&lt;SPAN&gt;()&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;i get an error :&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;Py4JJavaError: &lt;/SPAN&gt;An error occurred while calling o621.load. : Failure to initialize configuration for storage account [REDACTED].dfs.core.windows.net: Invalid configuration value detected for fs.azure.account.keyInvalid configuration value detected for fs.azure.account.key&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;the print for single_file :&amp;nbsp;&lt;A class="" href="abfss://external-sources@[REDACTED].dfs.core.windows.net/Bronze/Tribe_Report/20241210/visa-10079563/cards-11-15967860899208-10079563-20241210.xml" target="_blank" rel="noopener noreferrer"&gt;abfss://external-sources@[REDACTED].dfs.core.windows.net/***/***/*/testfile.xml&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;it was tested and there is a file like that in the blob.&lt;BR /&gt;&lt;BR /&gt;can library connect directly to the blob?&lt;BR /&gt;what is the format for that and the best practice?&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Mon, 20 Jan 2025 10:48:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/error-with-read-xml-data-using-the-spark-xml-library/m-p/106297#M9839</guid>
      <dc:creator>citizenX7042</dc:creator>
      <dc:date>2025-01-20T10:48:49Z</dc:date>
    </item>
    <item>
      <title>Re: Error with Read XML data using the spark-xml library</title>
      <link>https://community.databricks.com/t5/get-started-discussions/error-with-read-xml-data-using-the-spark-xml-library/m-p/106308#M9840</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/144685"&gt;@citizenX7042&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;Since the error indicates an issue&amp;nbsp;with the configuration value for &lt;CODE&gt;fs.azure.account.key&lt;/CODE&gt;&lt;/P&gt;
&lt;P class="p1"&gt;Can you test with the below code:&lt;/P&gt;
&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class="p1"&gt;from pyspark.sql.functions import regexp_extract, input_file_name&lt;/P&gt;
&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class="p1"&gt;# Set the storage account key&lt;/P&gt;
&lt;P class="p1"&gt;spark.conf.set("fs.azure.account.key.&amp;lt;your-storage-account-name&amp;gt;.dfs.core.windows.net", "&amp;lt;your-storage-account-key&amp;gt;")&lt;/P&gt;
&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class="p1"&gt;# Define the file path&lt;/P&gt;
&lt;P class="p1"&gt;single_file = "abfss://external-sources@&amp;lt;your-storage-account-name&amp;gt;.dfs.core.windows.net/Bronze/Tribe_Report/20241210/visa-10079563/cards-11-15967860899208-10079563-20241210.xml"&lt;/P&gt;
&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class="p1"&gt;# Load the single file&lt;/P&gt;
&lt;P class="p1"&gt;raw_df_single = (&lt;/P&gt;
&lt;P class="p1"&gt;&lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;spark.read.format("com.databricks.spark.xml")&lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &lt;/SPAN&gt;# XML format&lt;/P&gt;
&lt;P class="p1"&gt;&lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;.option("rowTag", "Card") &lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;# Specify the row tag for parsing&lt;/P&gt;
&lt;P class="p1"&gt;&lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;.load(single_file)&lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;# Load the single file&lt;/P&gt;
&lt;P class="p1"&gt;&lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;.withColumn("@FileName", regexp_extract(input_file_name(), r"([^/]+)$", 1))&lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &lt;/SPAN&gt;# Extract file name&lt;/P&gt;
&lt;P class="p1"&gt;)&lt;/P&gt;
&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class="p1"&gt;# Show a preview of the data&lt;/P&gt;
&lt;P class="p1"&gt;raw_df_single.show()&lt;/P&gt;</description>
      <pubDate>Mon, 20 Jan 2025 12:37:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/error-with-read-xml-data-using-the-spark-xml-library/m-p/106308#M9840</guid>
      <dc:creator>Alberto_Umana</dc:creator>
      <dc:date>2025-01-20T12:37:24Z</dc:date>
    </item>
    <item>
      <title>Re: Error with Read XML data using the spark-xml library</title>
      <link>https://community.databricks.com/t5/get-started-discussions/error-with-read-xml-data-using-the-spark-xml-library/m-p/106309#M9841</link>
      <description>&lt;P&gt;Please refer to:&amp;nbsp;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/connect/storage/azure-storage" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/connect/storage/azure-storage&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 20 Jan 2025 12:39:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/error-with-read-xml-data-using-the-spark-xml-library/m-p/106309#M9841</guid>
      <dc:creator>Alberto_Umana</dc:creator>
      <dc:date>2025-01-20T12:39:20Z</dc:date>
    </item>
    <item>
      <title>Re: Error with Read XML data using the spark-xml library</title>
      <link>https://community.databricks.com/t5/get-started-discussions/error-with-read-xml-data-using-the-spark-xml-library/m-p/107532#M9842</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/106294"&gt;@Alberto_Umana&lt;/a&gt;&amp;nbsp;, I am facing the same issue. It works when i try to read the xml file as text using spark.read.text(), but fails when I try to read it in xml format. I'm authenticating using spn and the config is correct as i'm able to read json files from the same folder and also the xml file in text as mentioned.&lt;/P&gt;&lt;P&gt;Also it works if i use the mounted path to the file and not when i use the abfss path.&lt;BR /&gt;&lt;BR /&gt;Could it be an issue with the spark-xml library not being able to work directly with abfss?&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;I have the following installed in my cluster:&amp;nbsp;&lt;A target="_blank"&gt;com.databricks:spark-xml_2.12:0.15.0&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Jan 2025 09:37:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/error-with-read-xml-data-using-the-spark-xml-library/m-p/107532#M9842</guid>
      <dc:creator>barsha_sharma</dc:creator>
      <dc:date>2025-01-29T09:37:33Z</dc:date>
    </item>
    <item>
      <title>Re: Error with Read XML data using the spark-xml library</title>
      <link>https://community.databricks.com/t5/get-started-discussions/error-with-read-xml-data-using-the-spark-xml-library/m-p/110468#M9843</link>
      <description>&lt;P&gt;UPDATE:&lt;BR /&gt;&lt;BR /&gt;It is now possible to read xml files directly:&amp;nbsp;&lt;A href="https://docs.databricks.com/en/query/formats/xml.html" target="_blank" rel="noopener"&gt;https://docs.databricks.com/en/query/formats/xml.html&lt;/A&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Make sure to update your &lt;SPAN&gt;Databricks Runtime to 14.3 and above, and remove the spark-xml maven library from your cluster.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 18 Feb 2025 10:52:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/error-with-read-xml-data-using-the-spark-xml-library/m-p/110468#M9843</guid>
      <dc:creator>barsha_sharma</dc:creator>
      <dc:date>2025-02-18T10:52:25Z</dc:date>
    </item>
  </channel>
</rss>

