Databricks

FrancisLau1897 · ‎08-03-2018

Both the following commands fail

df1 = sqlContext.read.format("xml").load(loadPath)

df2 = sqlContext.read.format("com.databricks.spark.xml").load(loadPath)

with the following error message:

java.lang.ClassNotFoundException: Failed to find data source: xml. Please find packages at http://spark.apache.org/third-party-projects.html

I read several articles on this forum but none had a resolution. I thought Databricks has the XML library installed already. This is on a DBC cluster with "4.2 (includes Apache Spark 2.3.1, Scala 2.11)"

sean_owen · ‎12-30-2018

You must add the spark-xml library to your cluster. No, it is not preinstalled in any runtime.

msft_Ted · ‎01-03-2019

I've installed the spark-xml library using the databricks spark package interface and it shows attached to the cluster - I get the same error (even after restarting the cluster.) Is there something I'm missing for installing the library?

sean_owen · ‎01-04-2019

Hm, it seems to work for me. I attached com.databricks:spark-xml:0.5.0 to a new runtime 5.1 cluster, and successfully executed a command like below. Did the library attach successfully? that should be all there is to it.

display(spark.read.option("rowTag", "book").format("xml").load("/dbfs/tmp/sean.owen/books.xml"))

msft_Ted · ‎01-04-2019

That was the issue - the Spark Packages version is 0.1.1, the maven central version is 0.5.0 - changing to use the Maven package made the whole thing work.

msft_Ted · ‎01-04-2019

Putting this as a top-level comment. credit to @srowen for the answer: Use the Maven Central library ( version 0.5.0) instead of the Spark Packages version (0.1.1)

VISWANATHANRENG · ‎05-19-2020

Adding further details to existing comments, latest packages can be derived from maven.

Example: com.databricks:spark-xml_2.12:0.9.0 is latest as of today. Here 2.12 means the latest Scala version. So we can choose latest jars based on our configuration.