Running Pyspark script getting the following error depending on which xml I query:
cannot resolve 'explode(...)' due to data type mismatch
The pyspark code:
from pyspark.sql import SparkSession
JOB_NAME = "Complex file to delimeted files transformer"
spark = SparkSession.builder.appName(JOB_NAME)\
.config("spark.scheduler.mode", "FAIR")\
.config('spark.jars.packages', 'com.databricks:spark-xml_2.12:0.12.0')\
.getOrCreate()
sql_script = "select create_date, item['_id'], item['_VALUE'] from my_data lateral view explode(items.item) t as item"
# works fine
read_options = {"rowTag": "my_data"}
df = spark.read\
.format("xml")\
.options(**read_options)\
.load("./xml")
df.createOrReplaceTempView("my_data")
spark.sql(sql_script).show()
# Error
df2 = spark.read\
.format("xml")\
.options(**read_options)\
.load("./xml/test2.xml")
df2.createOrReplaceTempView("my_data")
spark.sql(sql_script).show()
the xml is in xml folder.
test1.xml:
<my_data><create_date>2021-05-01</create_date><items><item id="1">item 1</item><item id="2">item 2</item></items>
</my_data>
test2.xml:
<my_data><create_date>2021-06-01</create_date><items><item id="3">item 3</item></items>
</my_data>
Expected result: the same SQL statement should work all the time and not break, nor have a chance of erroring if one run happens to have only one <item> in <items>.