cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Issue in reading parquet file in pyspark databricks.

irfanaziz
Contributor II

One of the source systems generates from time to time a parquet file which is only 220kb in size.

But reading it fails.

"java.io.IOException: Could not read or convert schema for file: 1-2022-00-51-56.parquet

Caused by: org.apache.spark.sql.AnalysisException: Parquet type not supported: INT32 (UINT_32);

"

I tried to use a schema and mergeSchema option

df =spark.read.options(mergeSchema=True).schema(mdd_schema_struct).parquet(target)

This is able to read the file and display but if you run count or merge it it would fail with

"Caused by: java.lang.RuntimeException: Illegal row group of 0 rows"

Does anyone know what could be the issue.

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

Seems that file is corrupted maybe you can ignore them by setting:

spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")

you can also check that setting:

sqlContext.setConf("spark.sql.parquet.filterPushdown","false")

you can register your files as table (pointed to that location with files) with correct schema set and than try to run:

%sql

MSCK REPAIR TABLE table_name

https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-repair-table.html

View solution in original post

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

Seems that file is corrupted maybe you can ignore them by setting:

spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")

you can also check that setting:

sqlContext.setConf("spark.sql.parquet.filterPushdown","false")

you can register your files as table (pointed to that location with files) with correct schema set and than try to run:

%sql

MSCK REPAIR TABLE table_name

https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-repair-table.html

irfanaziz
Contributor II

Yes i had to use the badRows option. Which put the bad files to a given path.

Anonymous
Not applicable

@nafri A​ - Howdy! My name is Piper, and I'm a community moderator for Databricks. Would you be happy to mark @Hubert Dudek​'s answer as best if it solved the problem? That will help other members find the answer more quickly. Thanks 🙂

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group