Databricks Community

peter_ticker · ‎01-24-2025

Hiya! I'm interested whether anyone has a solution to the following problem. If you load XML using Auto Loader or otherwise and set the schema to be such that a single value is assumed for a given xpath but the actual XML contains multiple values (i.e. should be an array) then the additional values aren't rescued so become irrecoverably dropped.

Example XML data: <Message><Attribute>1</Attribute><Attribute>2</Attribute></Message>

rowTag: Message

Schema applied: Attribute INTEGER (actual schema should be Attribute ARRAY<INTEGER>)

Result: Table containing one column Attribute with 2 as the value and the rescuedDataColumn is empty.

In the above example the value 1 in the additional Attribute tag is dropped and cannot be recovered because it doesn't land in the rescuedDataColumn. I need a way to ensure this data is not dropped because we cannot know whether any given tag in the XML being read has one node corresponding to it or is actually an array.

MadhuB · ‎01-24-2025

Hi @peter_ticker ,

Can you try this SQL snippet (refer to the attachment).

WITH xml_data AS (
SELECT '<ROOT><Message><Attribute>1</Attribute><Attribute>2</Attribute></Message><Message><Attribute>3</Attribute></Message><Message><Attribute>5</Attribute><Attribute>6</Attribute></Message></ROOT>' AS xml_string
),
parsed_xml AS (
SELECT
xpath(xml_string, 'ROOT/Message/Attribute/text()') AS attributes
FROM xml_data
)
SELECT
explode(attributes) AS attribute_value
FROM parsed_xml;

However, this approach may not work for complex and nested schemas.

When you are dealing with large and complex schema, I personally recommend either to convert to json or make use of external libraries like Maven spark-xml (com.databricks:spark-xml_2.12:0.16.0) with predefined schema.

Parsing using Maven Library:

Predefine the schema
Read the datafile as a dataframe with predefined schema - just schema is captured
Flatten it node level
Finally select the attributes that you wanted from the file.

I tried different approaches and settled with this as a best fit for my use case. This works even for very complex nested structure as well.

Let me know for anything, else please mark as accepted solution.

peter_ticker · ‎01-27-2025

Yes I understand that XPATH allows all the nodes to be selected but you're assuming that I already know there will be multiple Attribute nodes. My use case is streaming with a rather complicated schema so I'm concerned about the above example in that Databricks doesn't rescue data when the schema is initially misconfigured.

Converting to JSON would have the same problem - if there aren't files with multiple nodes then the converted JSON wouldn't use an array to store the resulting field which would lead to issues when multiple nodes start appearing.

Does the Maven library rescue data that wasn't captured in the schema like in the above example?

Avinash_Narala · ‎01-24-2025

Hi,

If reading XML in databricks is concern to you, you can refer to below link on how to parse XML files in databricks.

https://www.linkedin.com/pulse/reading-parsing-xml-files-databricks-avinash-narala-t2rtc/?trackingId...

If you find this helpful, mark it as solution.

Regards,

Avinash N

peter_ticker · ‎01-27-2025

Your reference provides nothing which would help with my question. Using BeautifulSoup you have to know exactly what the schema of the XML is in order to pull it into a Python object. If you incorrectly interpret the XML, then the result will be the same as my example.

Witold · ‎01-27-2025

Since DBR 14.3 XML supports is included natively, so there's no real reason to recommend third-party libraries anymore.

peter_ticker · ‎01-27-2025

Hiya! Yep I'm intending to use Databricks native XML reader but my question raises an issue with the rescuedDataColumn that personally prevents me from being able to use it.

Witold · ‎01-27-2025

@peter_tickerIn your screenshot, I don't see you using Auto Loader. This might be the reason, why rescued data is empty for you.

Besides, you can also play around with mode and columnNameOfCorruptRecord

peter_ticker · ‎01-27-2025

I used FROM_XML as it's the easiest to provide a reproducible example. The behaviour is the same in auto loader or otherwise. Databricks accepts that the schema I provided in the example is valid so doesn't consider the record corrupt or produce any rescued data. This is my concern.

Witold · ‎01-27-2025

Have you already checked columnNameOfCorruptRecord? I've seen cases in which this column contained data instead of the rescued data column. I believe the reason for that the XML parser fails before Auto Loader kicks in

peter_ticker · ‎01-27-2025

Let me know if I'm doing something wrong but this data doesn't generate the corrupt record column. See attached images.

Witold · ‎01-27-2025

This column will not appear magically, if you don't add it to your target table 😉

As an alternative, instead of looking at this table, you can also look directly at the source, e.g. replacing toTable with display

peter_ticker · ‎01-27-2025

Result is the same with a display. The table didn't exist so the schema of the table was determined by the input. See attached image.

Witold · ‎01-27-2025

Ah, now I see what you did.

`rowTag` is not the root element, it's the element which describes your rows within the root element. In your case, Message is the root element, Attribute are the rows. Meaning, set rowTag to Attribute. And since Attribute itself doesn't contain any sub-elements, you also need to set valueTag to Attribute

peter_ticker · ‎01-27-2025

No - I intended Message to be the element which describes rows. The example I set is one row of data containing a simple array field called Attribute.