Databricks

irfanaziz · ‎01-17-2022

One of the source systems generates from time to time a parquet file which is only 220kb in size.

But reading it fails.

"java.io.IOException: Could not read or convert schema for file: 1-2022-00-51-56.parquet

Caused by: org.apache.spark.sql.AnalysisException: Parquet type not supported: INT32 (UINT_32);

"

I tried to use a schema and mergeSchema option

df =spark.read.options(mergeSchema=True).schema(mdd_schema_struct).parquet(target)

This is able to read the file and display but if you run count or merge it it would fail with

"Caused by: java.lang.RuntimeException: Illegal row group of 0 rows"

Does anyone know what could be the issue.

Hubert-Dudek · ‎01-17-2022

Seems that file is corrupted maybe you can ignore them by setting:

spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")

you can also check that setting:

sqlContext.setConf("spark.sql.parquet.filterPushdown","false")

you can register your files as table (pointed to that location with files) with correct schema set and than try to run:

%sql

MSCK REPAIR TABLE table_name

https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-repair-table.html

View solution in original post

Hubert-Dudek · ‎01-17-2022

Seems that file is corrupted maybe you can ignore them by setting:

spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")

you can also check that setting:

sqlContext.setConf("spark.sql.parquet.filterPushdown","false")

you can register your files as table (pointed to that location with files) with correct schema set and than try to run:

%sql

MSCK REPAIR TABLE table_name

https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-repair-table.html

irfanaziz · ‎02-08-2022

Yes i had to use the badRows option. Which put the bad files to a given path.

Anonymous · ‎02-09-2022

@nafri A - Howdy! My name is Piper, and I'm a community moderator for Databricks. Would you be happy to mark @Hubert Dudek's answer as best if it solved the problem? That will help other members find the answer more quickly. Thanks 🙂

Databricks

Issue in reading parquet file in pyspark databricks.

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Register now and save 50% on training at Data + AI Summit!