cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Empty xml tag

YevheniiY
New Contributor II

 

<ItemMaintenance>
	<Batch>
	<BathInfo>info</BathInfo>
		<Item attr1="tekst" attr2="Tekst2">
			<ItemId type="Type" id="id"/>
			<Dates>
				<Start>2023-11-09</Start>
				<End>2024-01-02</End>
			</Dates>
			<MoreData>
			More data
			</MoreData>
		</Item>
	</Batch>
</ItemMaintenance>

 

Hello, I'm facing a problem with reading xml file via a new feature xml (auto) loader. Having similar structure and using rowTag as Item, reading is not done correctly. The data frame get only attributes, and the first line ItemID, further reading stops, although the tag is not closed and further there is still a lot of data. There are no errors, and does not say anything about incorrect operation. I tried different options for reading, removed attributes and so on. I used the native documentation https://docs.databricks.com/en/_extras/documents/native-xml-private-preview.pdf

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @YevheniiY

I understand that you’re encountering issues with reading an XML file using the new feature XML auto-loader. Let’s troubleshoot this together.

  1. Spark-XML Library Installation:

    • First, ensure that you have installed the Spark-XML OSS library on your Databricks cluster. Make sure the version of Spark-XML matches the version of Spark running on your cluster.
    • You can review the installation steps in the official documentation.
  2. Create the XML File:

    • Create an XML file with a structure similar to your data. For example:
    val xmlData =
      """
      <people>
        <person>
          <age born="1990-02-24">25</age>
        </person>
        <person>
          <age born="1985-01-01">30</age>
        </person>
        <person>
          <age born="1980-01-01">30</age>
        </person>
      </people>
      """
    dbutils.fs.put("/<path-to-save-xml-file>/<name-of-file>.xml", xmlData)
    
  3. Define Imports and UDF:

    • Import the necessary functions and define a user-defined function (UDF) to convert binary data to string format (required for streaming DataFrame).
    import com.databricks.spark.xml.functions.from_xml
    import com.databricks.spark.xml.schema_of_xml
    import org.apache.spark.sql.functions.{input_file_name, udf}
    
    val toStrUDF = udf((bytes: Array[Byte]) => new String(bytes, "UTF-8"))
    
  4. Extract XML Schema:

    • Extract the XML schema using the schema_of_xml method from Spark-XML. This can be inferred from the file using the binary data.
    val dfSchema = spark.read.format("binaryFile")
      .load("/FileStore/tables/test/xml/data/age/")
      .select(toStrUDF($"content").alias("text"))
    
    val payloadSchema = schema_of_xml(dfSchema.select("text").as[String])
    
  5. Implement the Stream Reader:

    • Now you can create the streaming DataFrame using the auto-loader features:
    val streamingDF = spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.useNotifications", "false") // Using listing mode
      .option("cloudFiles.format", "binaryFile")
      .load("/FileStore/tables/test/xml/data/age/")
      .select(toStrUDF($"content").alias("text"))
      .select(from_xml($"text", payloadSchema).alias("parsed"))
      .withColumn("path", input_file_name)
    
  6. Additional Considerations:

    • If you need to scale up your application, consider using fileNotificationMode instead of listing mode.
    • Ensure that your XML file adheres to the expected structure and that the rowTag is correctly specified.

Remember to replace placeholders like <path-to-save-xml-file> and <name-of-file> with actual paths and filenames relevant to your use case. If you encounter any specific errors or issues, feel free to share more details, and we can continue troubleshooting! 😊

 

YevheniiY
New Contributor II

@Kaniz Hi! I apologise for the late reply! 😅 I'm using the 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12) version of the cluster. I use the python programming language. When I read other files that do not have empty tags (i.e. a tag that does not have an opening tag and a closing tag, but has a "merged" tag) like in my first post the tag "ItemId", the code works marvellously, but when it encounters an empty tag the reading stops. If the document consists of "normal" tags that have an opening and closing tag, the reading works fine! 🙂

And it's marvellous and makes it very easy to work with files in databricks, but empty tags don't work for me 😅

df = spark.read.format('xml').options(rowTag='Item').load(test_file_location)
df.display()

Example for normal XML

  <people>
    <person>
      <age born="1990-02-24">25</age>
    </person>
    <person>
      <age born="1985-01-01">30</age>
    </person>
    <person>
      <age born="1980-01-01">30</age>
    </person>
  </people>

Example of empty tag

  <people>
    <person>
      <age_t born="1990-02-24"/>
      <age born="1990-02-24">25</age>
    </person>
    <person>
      <age_t born="1985-01-01"/>
      <age born="1985-01-01">30</age>
    </person>
    <person>
     <age_t born="1980-01-01"/>
      <age born="1980-01-01">30</age>
    </person>
  </people>

 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.