08-03-2018 10:35 AM
Both the following commands fail
df1 = sqlContext.read.format("xml").load(loadPath)
df2 = sqlContext.read.format("com.databricks.spark.xml").load(loadPath)
with the following error message:
java.lang.ClassNotFoundException: Failed to find data source: xml. Please find packages at http://spark.apache.org/third-party-projects.html
I read several articles on this forum but none had a resolution. I thought Databricks has the XML library installed already. This is on a DBC cluster with "4.2 (includes Apache Spark 2.3.1, Scala 2.11)"
12-30-2018 12:28 PM
You must add the spark-xml library to your cluster. No, it is not preinstalled in any runtime.
01-03-2019 09:46 PM
I've installed the spark-xml library using the databricks spark package interface and it shows attached to the cluster - I get the same error (even after restarting the cluster.) Is there something I'm missing for installing the library?
01-04-2019 07:11 AM
Hm, it seems to work for me. I attached com.databricks:spark-xml:0.5.0 to a new runtime 5.1 cluster, and successfully executed a command like below. Did the library attach successfully? that should be all there is to it.
display(spark.read.option("rowTag", "book").format("xml").load("/dbfs/tmp/sean.owen/books.xml"))
01-04-2019 02:09 PM
That was the issue - the Spark Packages version is 0.1.1, the maven central version is 0.5.0 - changing to use the Maven package made the whole thing work.
01-04-2019 02:11 PM
Putting this as a top-level comment. credit to @srowen for the answer: Use the Maven Central library ( version 0.5.0) instead of the Spark Packages version (0.1.1)
05-19-2020 08:34 AM
Adding further details to existing comments, latest packages can be derived from maven.
Example: com.databricks:spark-xml_2.12:0.9.0 is latest as of today. Here 2.12 means the latest Scala version. So we can choose latest jars based on our configuration.
06-09-2021 04:39 PM
Hi,
If you are getting this error is due com.sun.xml.bind library is obsolete now.
You need to download org.jvnet.jaxb2.maven package into a library by using Maven Central and attach that into a cluster.
Then you are going to be able to use xml-spark.
For further references you can check this page: https://datamajor.net/how-to-convert-dataframes-into-xml-files-on-spark/
Please tell me if you have more issues related with this library.
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.