2 weeks ago
Hiya! I'm interested whether anyone has a solution to the following problem. If you load XML using Auto Loader or otherwise and set the schema to be such that a single value is assumed for a given xpath but the actual XML contains multiple values (i.e. should be an array) then the additional values aren't rescued so become irrecoverably dropped.
Example XML data: <Message><Attribute>1</Attribute><Attribute>2</Attribute></Message>
2 weeks ago - last edited 2 weeks ago
Hi @peter_ticker ,
Can you try this SQL snippet (refer to the attachment).
WITH xml_data AS (
SELECT '<ROOT><Message><Attribute>1</Attribute><Attribute>2</Attribute></Message><Message><Attribute>3</Attribute></Message><Message><Attribute>5</Attribute><Attribute>6</Attribute></Message></ROOT>' AS xml_string
),
parsed_xml AS (
SELECT
xpath(xml_string, 'ROOT/Message/Attribute/text()') AS attributes
FROM xml_data
)
SELECT
explode(attributes) AS attribute_value
FROM parsed_xml;
However, this approach may not work for complex and nested schemas.
When you are dealing with large and complex schema, I personally recommend either to convert to json or make use of external libraries like Maven spark-xml (com.databricks:spark-xml_2.12:0.16.0) with predefined schema.
Parsing using Maven Library:
I tried different approaches and settled with this as a best fit for my use case. This works even for very complex nested structure as well.
Let me know for anything, else please mark as accepted solution.
2 weeks ago
Yes I understand that XPATH allows all the nodes to be selected but you're assuming that I already know there will be multiple Attribute nodes. My use case is streaming with a rather complicated schema so I'm concerned about the above example in that Databricks doesn't rescue data when the schema is initially misconfigured.
Converting to JSON would have the same problem - if there aren't files with multiple nodes then the converted JSON wouldn't use an array to store the resulting field which would lead to issues when multiple nodes start appearing.
Does the Maven library rescue data that wasn't captured in the schema like in the above example?
2 weeks ago
Hi,
If reading XML in databricks is concern to you, you can refer to below link on how to parse XML files in databricks.
If you find this helpful, mark it as solution.
Regards,
Avinash N
2 weeks ago
Your reference provides nothing which would help with my question. Using BeautifulSoup you have to know exactly what the schema of the XML is in order to pull it into a Python object. If you incorrectly interpret the XML, then the result will be the same as my example.
2 weeks ago
Since DBR 14.3 XML supports is included natively, so there's no real reason to recommend third-party libraries anymore.
2 weeks ago
Hiya! Yep I'm intending to use Databricks native XML reader but my question raises an issue with the rescuedDataColumn that personally prevents me from being able to use it.
2 weeks ago
@peter_tickerIn your screenshot, I don't see you using Auto Loader. This might be the reason, why rescued data is empty for you.
Besides, you can also play around with mode and columnNameOfCorruptRecord
2 weeks ago
I used FROM_XML as it's the easiest to provide a reproducible example. The behaviour is the same in auto loader or otherwise. Databricks accepts that the schema I provided in the example is valid so doesn't consider the record corrupt or produce any rescued data. This is my concern.
2 weeks ago
Have you already checked columnNameOfCorruptRecord? I've seen cases in which this column contained data instead of the rescued data column. I believe the reason for that the XML parser fails before Auto Loader kicks in
2 weeks ago
2 weeks ago
This column will not appear magically, if you don't add it to your target table 😉
As an alternative, instead of looking at this table, you can also look directly at the source, e.g. replacing toTable with display
2 weeks ago
2 weeks ago
Ah, now I see what you did.
`rowTag` is not the root element, it's the element which describes your rows within the root element. In your case, Message is the root element, Attribute are the rows. Meaning, set rowTag to Attribute. And since Attribute itself doesn't contain any sub-elements, you also need to set valueTag to Attribute
2 weeks ago
No - I intended Message to be the element which describes rows. The example I set is one row of data containing a simple array field called Attribute.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group