topic Issue in reading parquet file in pyspark databricks. in Data Engineering

Issue in reading parquet file in pyspark databricks.

irfanaziz — Mon, 17 Jan 2022 15:49:47 GMT

One of the source systems generates from time to time a parquet file which is only 220kb in size.

But reading it fails.

"java.io.IOException: Could not read or convert schema for file: 1-2022-00-51-56.parquet

Caused by: org.apache.spark.sql.AnalysisException: Parquet type not supported: INT32 (UINT_32);

I tried to use a schema and mergeSchema option

df =spark.read.options(mergeSchema=True).schema(mdd_schema_struct).parquet(target)

This is able to read the file and display but if you run count or merge it it would fail with

"Caused by: java.lang.RuntimeException: Illegal row group of 0 rows"

Does anyone know what could be the issue.

Re: Issue in reading parquet file in pyspark databricks.

Hubert-Dudek — Mon, 17 Jan 2022 16:14:43 GMT

Seems that file is corrupted maybe you can ignore them by setting:

spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")

you can also check that setting:

sqlContext.setConf("spark.sql.parquet.filterPushdown","false")

you can register your files as table (pointed to that location with files) with correct schema set and than try to run:

%sql

MSCK REPAIR TABLE table_name

https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-repair-table.html

Re: Issue in reading parquet file in pyspark databricks.

irfanaziz — Tue, 08 Feb 2022 14:52:28 GMT

Yes i had to use the badRows option. Which put the bad files to a given path.

Re: Issue in reading parquet file in pyspark databricks.

Anonymous — Wed, 09 Feb 2022 16:13:04 GMT

@nafri A - Howdy! My name is Piper, and I'm a community moderator for Databricks. Would you be happy to mark @Hubert Dudek's answer as best if it solved the problem? That will help other members find the answer more quickly. Thanks 🙂