- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-17-2022 07:49 AM
One of the source systems generates from time to time a parquet file which is only 220kb in size.
But reading it fails.
"java.io.IOException: Could not read or convert schema for file: 1-2022-00-51-56.parquet
Caused by: org.apache.spark.sql.AnalysisException: Parquet type not supported: INT32 (UINT_32);
"
I tried to use a schema and mergeSchema option
df =spark.read.options(mergeSchema=True).schema(mdd_schema_struct).parquet(target)
This is able to read the file and display but if you run count or merge it it would fail with
"Caused by: java.lang.RuntimeException: Illegal row group of 0 rows"
Does anyone know what could be the issue.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-17-2022 08:14 AM
Seems that file is corrupted maybe you can ignore them by setting:
spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
you can also check that setting:
sqlContext.setConf("spark.sql.parquet.filterPushdown","false")
you can register your files as table (pointed to that location with files) with correct schema set and than try to run:
%sql
MSCK REPAIR TABLE table_name
https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-repair-table.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-17-2022 08:14 AM
Seems that file is corrupted maybe you can ignore them by setting:
spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
you can also check that setting:
sqlContext.setConf("spark.sql.parquet.filterPushdown","false")
you can register your files as table (pointed to that location with files) with correct schema set and than try to run:
%sql
MSCK REPAIR TABLE table_name
https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-repair-table.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-08-2022 06:52 AM
Yes i had to use the badRows option. Which put the bad files to a given path.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-09-2022 08:13 AM
@nafri A - Howdy! My name is Piper, and I'm a community moderator for Databricks. Would you be happy to mark @Hubert Dudek's answer as best if it solved the problem? That will help other members find the answer more quickly. Thanks 🙂