Databricks Community

RobsonNLPT · ‎02-09-2024

I see the current conversion of dataframe to xml need to be improved.

My dataframe schema is a perfect nested schema based on structs but when I create a xml I have the follow issues:

1) I can't add elements to root

2) rootTag and rowTag are required

In the end I remove the first level of hierarchy (rowTag) using string methods or manually. The rowTag is already part of the dataframe nested schema so it doesn't make any sense

RobsonNLPT · ‎02-09-2024

Hi Kaniz. Willl test your suggestions but I think the documentation provided by Databricks / Spark should include those relevant topics in depth. I've seen lots of posts on web regarding this topic.

Thank you

RobsonNLPT · ‎02-09-2024

Hi Kaniz . I tested option("rowTag", "") using the library com.databricks:spark-xml_2.12:0.17.0 and also adb native format (runtime 14.3) but in both I got the error "requirement failed: 'rowTag' option should not be empty string"..

sandip_a · ‎02-09-2024

Here is one of the ways to use the struct field name as rowTag:

import org.apache.spark.sql.types._
val schema = new StructType().add("Record",
  new StructType().add("age", IntegerType).add("name", StringType))
val data = Seq(Row(Row(18, "John Doe")), Row(Row(19, "Mary Doe")))

val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
val rowTag = schema.fields.head.name
df.coalesce(1).select(s"$rowTag.*").write.mode("Overwrite").option("rowTag", rowTag).xml("/tmp/xml_test")

If the generated XML file above read again, it will have a flattened schema with two fields ('age' and 'name') instead of a single struct column.

RobsonNLPT · ‎02-10-2024

Hi. In this case rootTag is required also. Otherwise it will be the default "ROWS".

I have attributes at root level (in bold) before rows

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root x = 1>
<rat1>434343</rat1>
<rat2>
<x>4</x>
<y>6</y>
</rat2>
<rows>
<row>
<a>5</a>
<b>5</b>
</row>
<row>
<a>5</a>
<b>5</b>
</row>
</rows>
</root>