topic Databricks XML - Bypassing rootTag and rowTag in Get Started Discussions

Databricks XML - Bypassing rootTag and rowTag

RobsonNLPT — Fri, 09 Feb 2024 12:05:59 GMT

I see the current conversion of dataframe to xml need to be improved.

My dataframe schema is a perfect nested schema based on structs but when I create a xml I have the follow issues:

1) I can't add elements to root

2) rootTag and rowTag are required

In the end I remove the first level of hierarchy (rowTag) using string methods or manually. The rowTag is already part of the dataframe nested schema so it doesn't make any sense

Re: Databricks XML - Bypassing rootTag and rowTag

RobsonNLPT — Fri, 09 Feb 2024 12:16:19 GMT

Hi Kaniz. Willl test your suggestions but I think the documentation provided by Databricks / Spark should include those relevant topics in depth. I've seen lots of posts on web regarding this topic.

Thank you

Re: Databricks XML - Bypassing rootTag and rowTag

RobsonNLPT — Fri, 09 Feb 2024 12:36:25 GMT

Hi Kaniz . I tested option("rowTag", "") using the library com.databricks:spark-xml_2.12:0.17.0 and also adb native format (runtime 14.3) but in both I got the error "requirement failed: 'rowTag' option should not be empty string"..

Re: Databricks XML - Bypassing rootTag and rowTag

sandip_a — Sat, 10 Feb 2024 06:15:09 GMT

Here is one of the ways to use the struct field name as rowTag:

import org.apache.spark.sql.types._ val schema = new StructType().add("Record", new StructType().add("age", IntegerType).add("name", StringType)) val data = Seq(Row(Row(18, "John Doe")), Row(Row(19, "Mary Doe"))) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) val rowTag = schema.fields.head.name df.coalesce(1).select(s"$rowTag.*").write.mode("Overwrite").option("rowTag", rowTag).xml("/tmp/xml_test")

If the generated XML file above read again, it will have a flattened schema with two fields ('age' and 'name') instead of a single struct column.

Re: Databricks XML - Bypassing rootTag and rowTag

RobsonNLPT — Sat, 10 Feb 2024 17:35:43 GMT

Hi. In this case rootTag is required also. Otherwise it will be the default "ROWS".

I have attributes at root level (in bold) before rows

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root x = 1>
<rat1>434343</rat1>
<rat2>
<x>4</x>
<y>6</y>
</rat2>
<rows>
<row>
<a>5</a>
<b>5</b>
</row>
<row>
<a>5</a>
<b>5</b>
</row>
</rows>
</root>

The best would be bypassing rootTag and rowTag as my dataframe has the full nested structure. The behaviour should be same as json libraries