Databricks

Ben_Spark · ‎04-14-2022

I'm trying to use Spark-XML API and I'm facing issue with the XSD validation option.

Actually when I parser an XML file using the "rowValidationXSDPath" option the parser can't recognize the Prefixes/Namespaces declared at the root level.

For this to work I have to move down the namespace declaration to the level of RowTag.

Example

< myPrefix1:ParentMember>

< myPrefixe2:ChildMember>

............

</myPrefixe2:ChildMember>

<myPrefix1:ParentMember>

</RootTag>

Reading the above structure using the rowValidationXSDPath option would end with the following error : the prefix "myPrefixe2" for element "myPrefixe2:ChildMember" is not bound.

I know that was a bug in previous versions but wondering if it was fixed too when the option rowValidationXSDPath is enabled.

Thank you in advance for your help.

Ben_Spark · ‎05-11-2022

Hi

sorry for the late response got busy looking for a permanent solution to this problem .

At the end we are giving up on the XSDpath parser. This option does not work when Prefixes namespaces are declared at the ancestor level .

Thank you anyway for your help and support

View solution in original post

Kaniz · ‎04-18-2022

Hi @Ben Ben , This article describes how to read and write an XML file as an Apache Spark data source.

Ben_Spark · ‎04-18-2022

Hi Kaniz

Thank you for you answer.

I'm aware of the article and reading an XML without the XSD is not an issue.

The problem is that I need to validate my "row" against an XSD using rowValidationXSDPath , which does not support Prefixes at Row level with namespace declaration at ancestor level.

Kaniz · ‎04-18-2022

Hi @Ben Ben , You can validate individual rows against an XSD schema using

rowValidationXSDPath. You use the utility com.databricks.spark.XML.util.XSDToSchema to extract a Spark DataFrame schema from some XSD files.

It supports only simple, complex sequence types, only basic XSD functionality, and is experimental.

If you wish to add any feature request, please go ahead and share your ideas. We would love to hear.

Kaniz · ‎04-26-2022

Hi @Ben Ben , Would you like to raise a feature request?

Dan_Z · ‎05-04-2022

Hey @Ben Ben , so Spark-XML is not a package maintained by Databricks. It seems like the community doesn't have any inputs here. I'd suggest you reach out to the package maintainers via an Issue on their GitHub here: https://github.com/databricks/spark-xml.

Ben_Spark · ‎05-11-2022

Thank you Dan your feedback and proposal.

As per now I will parser the XML file differently. Really no time to raise a ticket and follow-up on it.

Kaniz · ‎05-11-2022

Hi @Ben Ben , Just a friendly follow-up. Do you still need help, or @Dan Zafar 's response help you to find the solution? Please let us know.

Ben_Spark · ‎05-11-2022

Hi

sorry for the late response got busy looking for a permanent solution to this problem .

At the end we are giving up on the XSDpath parser. This option does not work when Prefixes namespaces are declared at the ancestor level .

Thank you anyway for your help and support

Kaniz · ‎05-13-2022

Hi @Ben Ben , Thank you for providing the solution here.

Databricks

Databricks Spark XML parser : support for namespace declared at the ancestor level.

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs