spark-xml not working with Databricks Connect and Pyspark

brendan-b
New Contributor II

Hi all,

I currently have a cluster configured in databricks with spark-xml (version com.databricks:spark-xml_2.12:0.13.0) which was installed using Maven. The spark-xml library itself works fine with Pyspark when I am using it in a notebook within the databricks web-app.

I often use databricks connect with Pyspark for development though. More specifically, using VS Code. Again, databricks connect works fine when I am performing commands on the cluster such as spark.read.csv.

However, when I try and run my spark-xml code from within VS code, i receive the following error:

java.lang.ClassNotFoundException: Failed to find data source: xml. Please find packages at http://spark.apache.org/third-party-projects.html

I have tried using both read formats below with no luck. I have also tried placing the spark-xml jar file that matches the version in databricks within my Pyspark jars but again it did not work.

df = spark.read.format('xml')
 
df = spark.read.format('com.databricks.spark.xml')

Any ideas how I can get my local databricks connect venv to recognise the xml data source would be much appreciated!

Thanks!

sean_owen
Databricks Employee
Databricks Employee

Are you adding spark-xml as a dependency 'locally'? you're doing it right, and the name of the data source doesn't matter. Both are correct. You do not need to install JARs manually.

brendan-b
New Contributor II

@Sean Owen​ I do not believe I have. Do you have any documentation on how to install spark-xml locally? I have tried the following with no luck. IS this what you are referring to?

PYSPARK_HOME/bin/pyspark --packages com.databricks:spark-xml_2.12:0.13.0