Databricks

Michael_Galli · ‎07-26-2022

When writing unit tests with unittest / pytest in PySpark, reading mockup datasources with built-in datatypes like csv, json (spark.read.format("json")) works just fine.

But when reading XML´s with spark.read.format("com.databricks.spark.xml") in the unit test, this does not work out of the box:

java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml.

NOTE: the unit test do NOT run on a databricks cluster, but on a local hadoop winutils directory.

Is there any way to implement this, or should I use some python build-in xml libraries?

Michael_Galli · ‎07-26-2022

This is correct.. the following worked for me:

SparkSession.builder.(..).config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.12.0")

View solution in original post

-werners- · ‎07-26-2022

I suppose you run spark locally? Because com.databricks.spark.xml is a library for spark.

It is not installed by default so you have to add it to your local spark install.

Michael_Galli · ‎07-26-2022

This is correct.. the following worked for me:

SparkSession.builder.(..).config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.12.0")

Hubert-Dudek · ‎07-26-2022

Please install spark-xml from Maven. As it is from Maven you need to install it for cluster which you are using in cluster settings (alternatively using API or CLI)

https://mvnrepository.com/artifact/com.databricks/spark-xml

Michael_Galli · ‎07-26-2022

See above, I already found the solution. There is no cluster, but only a local spark session.

Databricks

Unittest in PySpark - how to read XML with Maven com.databricks.spark.xml ?

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI