Databricks XML - Bypassing rootTag and rowTag
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-09-2024 04:04 AM - edited 02-09-2024 04:05 AM
I see the current conversion of dataframe to xml need to be improved.
My dataframe schema is a perfect nested schema based on structs but when I create a xml I have the follow issues:
1) I can't add elements to root
2) rootTag and rowTag are required
In the end I remove the first level of hierarchy (rowTag) using string methods or manually. The rowTag is already part of the dataframe nested schema so it doesn't make any sense
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-09-2024 04:16 AM
Hi Kaniz. Willl test your suggestions but I think the documentation provided by Databricks / Spark should include those relevant topics in depth. I've seen lots of posts on web regarding this topic.
Thank you
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-09-2024 04:36 AM
Hi Kaniz . I tested option("rowTag", "") using the library com.databricks:spark-xml_2.12:0.17.0 and also adb native format (runtime 14.3) but in both I got the error "requirement failed: 'rowTag' option should not be empty string"..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-09-2024 10:15 PM
Here is one of the ways to use the struct field name as rowTag:
import org.apache.spark.sql.types._
val schema = new StructType().add("Record",
new StructType().add("age", IntegerType).add("name", StringType))
val data = Seq(Row(Row(18, "John Doe")), Row(Row(19, "Mary Doe")))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
val rowTag = schema.fields.head.name
df.coalesce(1).select(s"$rowTag.*").write.mode("Overwrite").option("rowTag", rowTag).xml("/tmp/xml_test")
If the generated XML file above read again, it will have a flattened schema with two fields ('age' and 'name') instead of a single struct column.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-10-2024 09:35 AM
Hi. In this case rootTag is required also. Otherwise it will be the default "ROWS".
I have attributes at root level (in bold) before rows
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root x = 1>
<rat1>434343</rat1>
<rat2>
<x>4</x>
<y>6</y>
</rat2>
<rows>
<row>
<a>5</a>
<b>5</b>
</row>
<row>
<a>5</a>
<b>5</b>
</row>
</rows>
</root>
The best would be bypassing rootTag and rowTag as my dataframe has the full nested structure. The behaviour should be same as json libraries

