Hi all,
I currently have a cluster configured in databricks with spark-xml (version com.databricks:spark-xml_2.12:0.13.0) which was installed using Maven. The spark-xml library itself works fine with Pyspark when I am using it in a notebook within the databricks web-app.
I often use databricks connect with Pyspark for development though. More specifically, using VS Code. Again, databricks connect works fine when I am performing commands on the cluster such as spark.read.csv.
However, when I try and run my spark-xml code from within VS code, i receive the following error:
java.lang.ClassNotFoundException: Failed to find data source: xml. Please find packages at http://spark.apache.org/third-party-projects.html
I have tried using both read formats below with no luck. I have also tried placing the spark-xml jar file that matches the version in databricks within my Pyspark jars but again it did not work.
df = spark.read.format('xml')
df = spark.read.format('com.databricks.spark.xml')
Any ideas how I can get my local databricks connect venv to recognise the xml data source would be much appreciated!
Thanks!