cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Getting "java.lang.ClassNotFoundException: Failed to find data source: xml" error when loading XML

FrancisLau1897
New Contributor

Both the following commands fail

df1 = sqlContext.read.format("xml").load(loadPath)

df2 = sqlContext.read.format("com.databricks.spark.xml").load(loadPath)

with the following error message:

java.lang.ClassNotFoundException: Failed to find data source: xml. Please find packages at http://spark.apache.org/third-party-projects.html

I read several articles on this forum but none had a resolution. I thought Databricks has the XML library installed already. This is on a DBC cluster with "4.2 (includes Apache Spark 2.3.1, Scala 2.11)"

7 REPLIES 7

sean_owen
Honored Contributor II
Honored Contributor II

You must add the spark-xml library to your cluster. No, it is not preinstalled in any runtime.

msft_Ted
New Contributor II

I've installed the spark-xml library using the databricks spark package interface and it shows attached to the cluster - I get the same error (even after restarting the cluster.) Is there something I'm missing for installing the library?

sean_owen
Honored Contributor II
Honored Contributor II

Hm, it seems to work for me. I attached com.databricks:spark-xml:0.5.0 to a new runtime 5.1 cluster, and successfully executed a command like below. Did the library attach successfully? that should be all there is to it.

display(spark.read.option("rowTag", "book").format("xml").load("/dbfs/tmp/sean.owen/books.xml"))

msft_Ted
New Contributor II

That was the issue - the Spark Packages version is 0.1.1, the maven central version is 0.5.0 - changing to use the Maven package made the whole thing work.

msft_Ted
New Contributor II

Putting this as a top-level comment. credit to @srowen for the answer: Use the Maven Central library ( version 0.5.0) instead of the Spark Packages version (0.1.1)

VISWANATHANRENG
New Contributor II

Adding further details to existing comments, latest packages can be derived from maven.

Example: com.databricks:spark-xml_2.12:0.9.0 is latest as of today. Here 2.12 means the latest Scala version. So we can choose latest jars based on our configuration.

alvaroagx
New Contributor II

Hi,

If you are getting this error is due com.sun.xml.bind library is obsolete now.

You need to download org.jvnet.jaxb2.maven package into a library by using Maven Central and attach that into a cluster.

Then you are going to be able to use xml-spark.

For further references you can check this page: https://datamajor.net/how-to-convert-dataframes-into-xml-files-on-spark/

Please tell me if you have more issues related with this library.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group