pyspark SQL cannot resolve 'explode()' due to data type mismatch

KevinXu — Wed, 11 May 2022 12:54:07 GMT

Running Pyspark script getting the following error depending on which xml I query:

cannot resolve 'explode(...)' due to data type mismatch

The pyspark code:

from pyspark.sql import SparkSession
 
JOB_NAME = "Complex file to delimeted files transformer"
 
spark = SparkSession.builder.appName(JOB_NAME)\
    .config("spark.scheduler.mode", "FAIR")\
    .config('spark.jars.packages', 'com.databricks:spark-xml_2.12:0.12.0')\
    .getOrCreate()
 
sql_script = "select create_date, item['_id'], item['_VALUE'] from my_data lateral view explode(items.item) t as item"
 
# works fine
read_options = {"rowTag": "my_data"}
df = spark.read\
    .format("xml")\
    .options(**read_options)\
    .load("./xml")
df.createOrReplaceTempView("my_data")
spark.sql(sql_script).show()
 
# Error
df2 = spark.read\
    .format("xml")\
    .options(**read_options)\
    .load("./xml/test2.xml")
df2.createOrReplaceTempView("my_data")
spark.sql(sql_script).show()

the xml is in xml folder.

test1.xml:

<my_data><create_date>2021-05-01</create_date><items><item id="1">item 1</item><item id="2">item 2</item></items>
</my_data>

test2.xml:

<my_data><create_date>2021-06-01</create_date><items><item id="3">item 3</item></items>
</my_data>

Expected result: the same SQL statement should work all the time and not break, nor have a chance of erroring if one run happens to have only one <item> in <items>.

Re: pyspark SQL cannot resolve 'explode()' due to data type mismatch

KevinXu — Sun, 29 May 2022 00:13:07 GMT

It's on line 10

sql_script = "select create_date, item['_id'], item['_VALUE'] from my_data lateral view explode(items.item) t as item"

topic Re: pyspark SQL cannot resolve 'explode()' due to data type mismatch in Data Engineering

pyspark SQL cannot resolve 'explode()' due to data type mismatch

Re: pyspark SQL cannot resolve 'explode()' due to data type mismatch