Databricks Community

RobsonNLPT · ‎02-09-2024

I see the current conversion of dataframe to xml need to be improved.

My dataframe schema is a perfect nested schema based on structs but when I create a xml I have the follow issues:

1) I can't add elements to root

2) rootTag and rowTag are required

In the end I remove the first level of hierarchy (rowTag) using string methods or manually. The rowTag is already part of the dataframe nested schema so it doesn't make any sense

Kaniz_Fatma · ‎02-09-2024

Hi @RobsonNLPT, Converting DataFrames to XML in Databricks can be tricky, especially when dealing with nested schemas and specific XML requirements.

Let’s address your issues:

Adding Elements to Root:

By default, the rootTag is required when writing a DataFrame to XML. However, if you want to add custom elements to the root, you can create a new DataFrame with the desired structure and then merge it with your original DataFrame.
Here’s an example using Scala:// Suppose your original DataFrame is 'df' val customRootDF = Seq(("customElement1", "value1"), ("customElement2", "value2")) .toDF("elementName", "elementValue") val mergedDF = df.union(customRootDF) // Write mergedDF to XML mergedDF.write .format("com.databricks.spark.xml") .option("rootTag", "root") .option("rowTag", "row") .save("output.xml")
In this example, we create a new DataFrame (customRootDF) with custom elements and then merge it with the original DataFrame (df). The resulting XML will have the custom elements at the root level.

Removing the RowTag:

You’re right that the rowTag is often unnecessary when your DataFrame already has a nested structure. To avoid it, you can specify the desired nested structure directly.
Here’s how you can write your DataFrame without the rowTag:df.write .format("com.databricks.spark.xml") .option("rootTag", "root") .option("rowTag", "") // Set rowTag to an empty string .save("output.xml")
In this case, the resulting XML won’t have the unnecessary rowTag.

Adjust the column names and values according to your DataFrame schema. If you’re using PySpark, the process is similar—replace Scala syntax with Python.

Feel free to adapt these examples to your specific use case, and let me know if you need further assistance! 😊

RobsonNLPT · ‎02-09-2024

Hi Kaniz. Willl test your suggestions but I think the documentation provided by Databricks / Spark should include those relevant topics in depth. I've seen lots of posts on web regarding this topic.

Thank you

RobsonNLPT · ‎02-09-2024

Hi Kaniz . I tested option("rowTag", "") using the library com.databricks:spark-xml_2.12:0.17.0 and also adb native format (runtime 14.3) but in both I got the error "requirement failed: 'rowTag' option should not be empty string"..

sandip_a · ‎02-09-2024

Here is one of the ways to use the struct field name as rowTag:

import org.apache.spark.sql.types._
val schema = new StructType().add("Record",
  new StructType().add("age", IntegerType).add("name", StringType))
val data = Seq(Row(Row(18, "John Doe")), Row(Row(19, "Mary Doe")))

val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
val rowTag = schema.fields.head.name
df.coalesce(1).select(s"$rowTag.*").write.mode("Overwrite").option("rowTag", rowTag).xml("/tmp/xml_test")

If the generated XML file above read again, it will have a flattened schema with two fields ('age' and 'name') instead of a single struct column.

RobsonNLPT · ‎02-10-2024

Hi. In this case rootTag is required also. Otherwise it will be the default "ROWS".

I have attributes at root level (in bold) before rows

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root x = 1>
<rat1>434343</rat1>
<rat2>
<x>4</x>
<y>6</y>
</rat2>
<rows>
<row>
<a>5</a>
<b>5</b>
</row>
<row>
<a>5</a>
<b>5</b>
</row>
</rows>
</root>

The best would be bypassing rootTag and rowTag as my dataframe has the full nested structure. The behaviour should be same as json libraries

Databricks Community

Databricks XML - Bypassing rootTag and rowTag

🔔 ALERT: Act Now to Protect Your Community Account; Secure Your Details Before It's Too Late!

Databricks Learning Festival (Virtual): 10 July - 24 July 2024

Data + AI Summit 2024: An Executive Summary for Data Leaders

Big Data Is Back and Is More Important Than AI