cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks XML - Bypassing rootTag and rowTag

RobsonNLPT
Contributor

I see the current conversion of dataframe to xml need to be improved.

My dataframe schema is a perfect nested schema based on structs but when I create a xml I have the follow issues:

1) I can't add elements to root

2) rootTag and rowTag are required

In the end I remove the first level of hierarchy (rowTag) using string methods or manually. The rowTag is already part of the dataframe nested schema so it doesn't make any sense

 

 

 

 

5 REPLIES 5

Kaniz_Fatma
Community Manager
Community Manager

Hi @RobsonNLPT, Converting DataFrames to XML in Databricks can be tricky, especially when dealing with nested schemas and specific XML requirements. 

 

Let’s address your issues:

 

Adding Elements to Root:

  • By default, the rootTag is required when writing a DataFrame to XML. However, if you want to add custom elements to the root, you can create a new DataFrame with the desired structure and then merge it with your original DataFrame.
  • Here’s an example using Scala:// Suppose your original DataFrame is 'df' val customRootDF = Seq(("customElement1", "value1"), ("customElement2", "value2"))  .toDF("elementName", "elementValue") val mergedDF = df.union(customRootDF) // Write mergedDF to XML mergedDF.write  .format("com.databricks.spark.xml")  .option("rootTag", "root")  .option("rowTag", "row")  .save("output.xml")
  • In this example, we create a new DataFrame (customRootDF) with custom elements and then merge it with the original DataFrame (df). The resulting XML will have the custom elements at the root level.

Removing the RowTag:

  • You’re right that the rowTag is often unnecessary when your DataFrame already has a nested structure. To avoid it, you can specify the desired nested structure directly.
  • Here’s how you can write your DataFrame without the rowTag:df.write  .format("com.databricks.spark.xml")  .option("rootTag", "root")  .option("rowTag", "") // Set rowTag to an empty string  .save("output.xml")
  • In this case, the resulting XML won’t have the unnecessary rowTag.

Adjust the column names and values according to your DataFrame schema. If you’re using PySpark, the process is similar—replace Scala syntax with Python.

 

Feel free to adapt these examples to your specific use case, and let me know if you need further assistance! 😊

Hi Kaniz. Willl test your suggestions but I think the documentation provided by Databricks / Spark  should include those relevant topics in depth. I've seen lots of posts on web regarding this topic.

Thank you

Hi Kaniz . I tested option("rowTag", "") using the library com.databricks:spark-xml_2.12:0.17.0 and also adb native format (runtime 14.3) but in both I got the error  "requirement failed: 'rowTag' option should not be empty string"..

 

sandip_a
New Contributor II
New Contributor II

Here is one of the ways to use the struct field name as rowTag:

 

 
import org.apache.spark.sql.types._
val schema = new StructType().add("Record",
  new StructType().add("age", IntegerType).add("name", StringType))
val data = Seq(Row(Row(18, "John Doe")), Row(Row(19, "Mary Doe")))

val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
val rowTag = schema.fields.head.name
df.coalesce(1).select(s"$rowTag.*").write.mode("Overwrite").option("rowTag", rowTag).xml("/tmp/xml_test")

If the generated XML file above read again, it will have a flattened schema with two fields ('age' and 'name') instead of a single struct column.

Hi. In this case rootTag is required also. Otherwise it will be the default "ROWS".

I have attributes at root level (in bold) before rows

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root x = 1>
 <rat1>434343</rat1>
 <rat2>
 <x>4</x>
 <y>6</y>
 </rat2>
 <rows>
  <row>
   <a>5</a>
   <b>5</b>
  </row>
  <row>
   <a>5</a>
   <b>5</b>
  </row>
</rows>
</root>

The best would be bypassing rootTag and rowTag as my dataframe has the full nested structure. The behaviour should be same as json libraries

 

 

 

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!