spark-xml not working with Databricks Connect and Pyspark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-09-2021 05:35 PM
Hi all,
I currently have a cluster configured in databricks with spark-xml (version com.databricks:spark-xml_2.12:0.13.0) which was installed using Maven. The spark-xml library itself works fine with Pyspark when I am using it in a notebook within the databricks web-app.
I often use databricks connect with Pyspark for development though. More specifically, using VS Code. Again, databricks connect works fine when I am performing commands on the cluster such as spark.read.csv.
However, when I try and run my spark-xml code from within VS code, i receive the following error:
java.lang.ClassNotFoundException: Failed to find data source: xml. Please find packages at http://spark.apache.org/third-party-projects.htmlI have tried using both read formats below with no luck. I have also tried placing the spark-xml jar file that matches the version in databricks within my Pyspark jars but again it did not work.
df = spark.read.format('xml')
df = spark.read.format('com.databricks.spark.xml')Any ideas how I can get my local databricks connect venv to recognise the xml data source would be much appreciated!
Thanks!
- Labels:
-
Cluster
-
Databricks connect
-
Pyspark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-10-2021 09:26 AM
Are you adding spark-xml as a dependency 'locally'? you're doing it right, and the name of the data source doesn't matter. Both are correct. You do not need to install JARs manually.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-10-2021 02:55 PM
@Sean Owen I do not believe I have. Do you have any documentation on how to install spark-xml locally? I have tried the following with no luck. IS this what you are referring to?
PYSPARK_HOME/bin/pyspark --packages com.databricks:spark-xml_2.12:0.13.0