08-03-2018 10:35 AM
Both the following commands fail
df1 = sqlContext.read.format("xml").load(loadPath)
df2 = sqlContext.read.format("com.databricks.spark.xml").load(loadPath)
with the following error message:
java.lang.ClassNotFoundException: Failed to find data source: xml. Please find packages at http://spark.apache.org/third-party-projects.html
I read several articles on this forum but none had a resolution. I thought Databricks has the XML library installed already. This is on a DBC cluster with "4.2 (includes Apache Spark 2.3.1, Scala 2.11)"
12-30-2018 12:28 PM
You must add the spark-xml library to your cluster. No, it is not preinstalled in any runtime.
01-03-2019 09:46 PM
I've installed the spark-xml library using the databricks spark package interface and it shows attached to the cluster - I get the same error (even after restarting the cluster.) Is there something I'm missing for installing the library?
01-04-2019 07:11 AM
Hm, it seems to work for me. I attached com.databricks:spark-xml:0.5.0 to a new runtime 5.1 cluster, and successfully executed a command like below. Did the library attach successfully? that should be all there is to it.
display(spark.read.option("rowTag", "book").format("xml").load("/dbfs/tmp/sean.owen/books.xml"))
01-04-2019 02:09 PM
That was the issue - the Spark Packages version is 0.1.1, the maven central version is 0.5.0 - changing to use the Maven package made the whole thing work.
01-04-2019 02:11 PM
Putting this as a top-level comment. credit to @srowen for the answer: Use the Maven Central library ( version 0.5.0) instead of the Spark Packages version (0.1.1)
05-19-2020 08:34 AM
Adding further details to existing comments, latest packages can be derived from maven.
Example: com.databricks:spark-xml_2.12:0.9.0 is latest as of today. Here 2.12 means the latest Scala version. So we can choose latest jars based on our configuration.
06-09-2021 04:39 PM
Hi,
If you are getting this error is due com.sun.xml.bind library is obsolete now.
You need to download org.jvnet.jaxb2.maven package into a library by using Maven Central and attach that into a cluster.
Then you are going to be able to use xml-spark.
For further references you can check this page: https://datamajor.net/how-to-convert-dataframes-into-xml-files-on-spark/
Please tell me if you have more issues related with this library.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group