running a query against multiple parquet files fro...

Shaimaa · ‎06-25-2024

I am runninng a query against multiple parquet files:

SELECT
SUM(CASE WHEN match_result.year_incorporated IS NOT NULL AND match_result.year_incorporated != '' THEN 1 ELSE 0 END)
FROM 
parquet.`s3://folder_path/*`

for some files, the field `year_incorporated` has a string value, and for some of files the entire field is null. I am getting this error for the file with all null values:

Error while reading file s3://file_path.PARQUET. Schema conversion error: cannot convert Parquet type INT32 to Photon type string(0)

How can I fix this issue?

daniel_sahal · ‎06-25-2024

@Shaimaa
The column type mismatch between the files could be an issue here.
For example: if in one file column 'xyz' is a type of INTEGER and in another one the same column is a type of STRING, Spark will give you a schema conversion error.
Below is a link for a good article that explains the issue a little bit more, however the best solution would be to fix the column types in the source files or by changing the file format.

https://medium.com/data-arena/merging-different-schemas-in-apache-spark-2a9caca2c5ce

running a query against multiple parquet files from a folder