- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ07-26-2022 02:30 AM
When writing unit tests with unittest / pytest in PySpark, reading mockup datasources with built-in datatypes like csv, json (spark.read.format("json")) works just fine.
But when reading XMLยดs with spark.read.format("com.databricks.spark.xml") in the unit test, this does not work out of the box:
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml.
NOTE: the unit test do NOT run on a databricks cluster, but on a local hadoop winutils directory.
Is there any way to implement this, or should I use some python build-in xml libraries?
- Labels:
-
Pyspark
-
Unit Tests
-
Xml
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ07-26-2022 05:49 AM
This is correct.. the following worked for me:
SparkSession.builder.(..).config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.12.0")
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ07-26-2022 05:10 AM
I suppose you run spark locally? Because com.databricks.spark.xml is a library for spark.
It is not installed by default so you have to add it to your local spark install.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ07-26-2022 05:49 AM
This is correct.. the following worked for me:
SparkSession.builder.(..).config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.12.0")
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ07-26-2022 05:51 AM
Please install spark-xml from Maven. As it is from Maven you need to install it for cluster which you are using in cluster settings (alternatively using API or CLI)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ07-26-2022 06:19 AM
See above, I already found the solution. There is no cluster, but only a local spark session.

