cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Getting "java.lang.ClassNotFoundException: Failed to find data source: xml" error when loading XML

FrancisLau1897
New Contributor

Both the following commands fail

df1 = sqlContext.read.format("xml").load(loadPath)

df2 = sqlContext.read.format("com.databricks.spark.xml").load(loadPath)

with the following error message:

java.lang.ClassNotFoundException: Failed to find data source: xml. Please find packages at http://spark.apache.org/third-party-projects.html

I read several articles on this forum but none had a resolution. I thought Databricks has the XML library installed already. This is on a DBC cluster with "4.2 (includes Apache Spark 2.3.1, Scala 2.11)"

7 REPLIES 7

sean_owen
Databricks Employee
Databricks Employee

You must add the spark-xml library to your cluster. No, it is not preinstalled in any runtime.

msft_Ted
New Contributor II

I've installed the spark-xml library using the databricks spark package interface and it shows attached to the cluster - I get the same error (even after restarting the cluster.) Is there something I'm missing for installing the library?

sean_owen
Databricks Employee
Databricks Employee

Hm, it seems to work for me. I attached com.databricks:spark-xml:0.5.0 to a new runtime 5.1 cluster, and successfully executed a command like below. Did the library attach successfully? that should be all there is to it.

display(spark.read.option("rowTag", "book").format("xml").load("/dbfs/tmp/sean.owen/books.xml"))

msft_Ted
New Contributor II

That was the issue - the Spark Packages version is 0.1.1, the maven central version is 0.5.0 - changing to use the Maven package made the whole thing work.

msft_Ted
New Contributor II

Putting this as a top-level comment. credit to @srowen for the answer: Use the Maven Central library ( version 0.5.0) instead of the Spark Packages version (0.1.1)

VISWANATHANRENG
New Contributor II

Adding further details to existing comments, latest packages can be derived from maven.

Example: com.databricks:spark-xml_2.12:0.9.0 is latest as of today. Here 2.12 means the latest Scala version. So we can choose latest jars based on our configuration.

alvaroagx
New Contributor II

Hi,

If you are getting this error is due com.sun.xml.bind library is obsolete now.

You need to download org.jvnet.jaxb2.maven package into a library by using Maven Central and attach that into a cluster.

Then you are going to be able to use xml-spark.

For further references you can check this page: https://datamajor.net/how-to-convert-dataframes-into-xml-files-on-spark/

Please tell me if you have more issues related with this library.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now