topic running a query against multiple parquet files from a folder in Warehousing & Analytics

running a query against multiple parquet files from a folder

Shaimaa — Tue, 25 Jun 2024 15:13:21 GMT

I am runninng a query against multiple parquet files:

SELECT SUM(CASE WHEN match_result.year_incorporated IS NOT NULL AND match_result.year_incorporated != '' THEN 1 ELSE 0 END) FROM parquet.`s3://folder_path/*`

for some files, the field `year_incorporated` has a string value, and for some of files the entire field is null. I am getting this error for the file with all null values:

Error while reading file s3://file_path.PARQUET. Schema conversion error: cannot convert Parquet type INT32 to Photon type string(0)

How can I fix this issue?

Re: running a query against multiple parquet files from a folder

daniel_sahal — Wed, 26 Jun 2024 06:30:45 GMT

@Shaimaa
The column type mismatch between the files could be an issue here.
For example: if in one file column 'xyz' is a type of INTEGER and in another one the same column is a type of STRING, Spark will give you a schema conversion error.
Below is a link for a good article that explains the issue a little bit more, however the best solution would be to fix the column types in the source files or by changing the file format.

https://medium.com/data-arena/merging-different-schemas-in-apache-spark-2a9caca2c5ce