โ02-09-2024 04:04 AM - edited โ02-09-2024 04:05 AM
I see the current conversion of dataframe to xml need to be improved.
My dataframe schema is a perfect nested schema based on structs but when I create a xml I have the follow issues:
1) I can't add elements to root
2) rootTag and rowTag are required
In the end I remove the first level of hierarchy (rowTag) using string methods or manually. The rowTag is already part of the dataframe nested schema so it doesn't make any sense
โ02-09-2024 04:11 AM
Hi @RobsonNLPT, Converting DataFrames to XML in Databricks can be tricky, especially when dealing with nested schemas and specific XML requirements.
Letโs address your issues:
Adding Elements to Root:
Removing the RowTag:
Adjust the column names and values according to your DataFrame schema. If youโre using PySpark, the process is similarโreplace Scala syntax with Python.
Feel free to adapt these examples to your specific use case, and let me know if you need further assistance! ๐
โ02-09-2024 04:16 AM
Hi Kaniz. Willl test your suggestions but I think the documentation provided by Databricks / Spark should include those relevant topics in depth. I've seen lots of posts on web regarding this topic.
Thank you
โ02-09-2024 04:36 AM
Hi Kaniz . I tested option("rowTag", "") using the library com.databricks:spark-xml_2.12:0.17.0 and also adb native format (runtime 14.3) but in both I got the error "requirement failed: 'rowTag' option should not be empty string"..
โ02-09-2024 10:15 PM
Here is one of the ways to use the struct field name as rowTag:
import org.apache.spark.sql.types._
val schema = new StructType().add("Record",
new StructType().add("age", IntegerType).add("name", StringType))
val data = Seq(Row(Row(18, "John Doe")), Row(Row(19, "Mary Doe")))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
val rowTag = schema.fields.head.name
df.coalesce(1).select(s"$rowTag.*").write.mode("Overwrite").option("rowTag", rowTag).xml("/tmp/xml_test")
If the generated XML file above read again, it will have a flattened schema with two fields ('age' and 'name') instead of a single struct column.
โ02-10-2024 09:35 AM
Hi. In this case rootTag is required also. Otherwise it will be the default "ROWS".
I have attributes at root level (in bold) before rows
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root x = 1>
<rat1>434343</rat1>
<rat2>
<x>4</x>
<y>6</y>
</rat2>
<rows>
<row>
<a>5</a>
<b>5</b>
</row>
<row>
<a>5</a>
<b>5</b>
</row>
</rows>
</root>
The best would be bypassing rootTag and rowTag as my dataframe has the full nested structure. The behaviour should be same as json libraries
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group