running a query against multiple parquet files from a folder
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-25-2024 08:13 AM
I am runninng a query against multiple parquet files:
SELECT
SUM(CASE WHEN match_result.year_incorporated IS NOT NULL AND match_result.year_incorporated != '' THEN 1 ELSE 0 END)
FROM
parquet.`s3://folder_path/*`for some files, the field `year_incorporated` has a string value, and for some of files the entire field is null. I am getting this error for the file with all null values:
Error while reading file s3://file_path.PARQUET. Schema conversion error: cannot convert Parquet type INT32 to Photon type string(0)How can I fix this issue?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-25-2024 11:30 PM - edited 06-25-2024 11:30 PM
@Shaimaa
The column type mismatch between the files could be an issue here.
For example: if in one file column 'xyz' is a type of INTEGER and in another one the same column is a type of STRING, Spark will give you a schema conversion error.
Below is a link for a good article that explains the issue a little bit more, however the best solution would be to fix the column types in the source files or by changing the file format.
https://medium.com/data-arena/merging-different-schemas-in-apache-spark-2a9caca2c5ce