โ07-26-2022 02:30 AM
When writing unit tests with unittest / pytest in PySpark, reading mockup datasources with built-in datatypes like csv, json (spark.read.format("json")) works just fine.
But when reading XMLยดs with spark.read.format("com.databricks.spark.xml") in the unit test, this does not work out of the box:
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml.
NOTE: the unit test do NOT run on a databricks cluster, but on a local hadoop winutils directory.
Is there any way to implement this, or should I use some python build-in xml libraries?
โ07-26-2022 05:49 AM
This is correct.. the following worked for me:
SparkSession.builder.(..).config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.12.0")
โ07-26-2022 05:10 AM
I suppose you run spark locally? Because com.databricks.spark.xml is a library for spark.
It is not installed by default so you have to add it to your local spark install.
โ07-26-2022 05:49 AM
This is correct.. the following worked for me:
SparkSession.builder.(..).config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.12.0")
โ07-26-2022 05:51 AM
Please install spark-xml from Maven. As it is from Maven you need to install it for cluster which you are using in cluster settings (alternatively using API or CLI)
โ07-26-2022 06:19 AM
See above, I already found the solution. There is no cluster, but only a local spark session.
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.