cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Getting "java.lang.ClassNotFoundException: Failed to find data source: xml" error when loading XML

FrancisLau1897
New Contributor

Both the following commands fail

df1 = sqlContext.read.format("xml").load(loadPath)

df2 = sqlContext.read.format("com.databricks.spark.xml").load(loadPath)

with the following error message:

java.lang.ClassNotFoundException: Failed to find data source: xml. Please find packages at http://spark.apache.org/third-party-projects.html

I read several articles on this forum but none had a resolution. I thought Databricks has the XML library installed already. This is on a DBC cluster with "4.2 (includes Apache Spark 2.3.1, Scala 2.11)"

7 REPLIES 7

sean_owen
Honored Contributor II
Honored Contributor II

You must add the spark-xml library to your cluster. No, it is not preinstalled in any runtime.

msft_Ted
New Contributor II

I've installed the spark-xml library using the databricks spark package interface and it shows attached to the cluster - I get the same error (even after restarting the cluster.) Is there something I'm missing for installing the library?

sean_owen
Honored Contributor II
Honored Contributor II

Hm, it seems to work for me. I attached com.databricks:spark-xml:0.5.0 to a new runtime 5.1 cluster, and successfully executed a command like below. Did the library attach successfully? that should be all there is to it.

display(spark.read.option("rowTag", "book").format("xml").load("/dbfs/tmp/sean.owen/books.xml"))

msft_Ted
New Contributor II

That was the issue - the Spark Packages version is 0.1.1, the maven central version is 0.5.0 - changing to use the Maven package made the whole thing work.

msft_Ted
New Contributor II

Putting this as a top-level comment. credit to @srowen for the answer: Use the Maven Central library ( version 0.5.0) instead of the Spark Packages version (0.1.1)

VISWANATHANRENG
New Contributor II

Adding further details to existing comments, latest packages can be derived from maven.

Example: com.databricks:spark-xml_2.12:0.9.0 is latest as of today. Here 2.12 means the latest Scala version. So we can choose latest jars based on our configuration.

alvaroagx
New Contributor II

Hi,

If you are getting this error is due com.sun.xml.bind library is obsolete now.

You need to download org.jvnet.jaxb2.maven package into a library by using Maven Central and attach that into a cluster.

Then you are going to be able to use xml-spark.

For further references you can check this page: https://datamajor.net/how-to-convert-dataframes-into-xml-files-on-spark/

Please tell me if you have more issues related with this library.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.