cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Issue in reading parquet file in pyspark databricks.

irfanaziz
Contributor II

One of the source systems generates from time to time a parquet file which is only 220kb in size.

But reading it fails.

"java.io.IOException: Could not read or convert schema for file: 1-2022-00-51-56.parquet

Caused by: org.apache.spark.sql.AnalysisException: Parquet type not supported: INT32 (UINT_32);

"

I tried to use a schema and mergeSchema option

df =spark.read.options(mergeSchema=True).schema(mdd_schema_struct).parquet(target)

This is able to read the file and display but if you run count or merge it it would fail with

"Caused by: java.lang.RuntimeException: Illegal row group of 0 rows"

Does anyone know what could be the issue.

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

Seems that file is corrupted maybe you can ignore them by setting:

spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")

you can also check that setting:

sqlContext.setConf("spark.sql.parquet.filterPushdown","false")

you can register your files as table (pointed to that location with files) with correct schema set and than try to run:

%sql

MSCK REPAIR TABLE table_name

https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-repair-table.html

View solution in original post

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

Seems that file is corrupted maybe you can ignore them by setting:

spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")

you can also check that setting:

sqlContext.setConf("spark.sql.parquet.filterPushdown","false")

you can register your files as table (pointed to that location with files) with correct schema set and than try to run:

%sql

MSCK REPAIR TABLE table_name

https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-repair-table.html

irfanaziz
Contributor II

Yes i had to use the badRows option. Which put the bad files to a given path.

Anonymous
Not applicable

@nafri A​ - Howdy! My name is Piper, and I'm a community moderator for Databricks. Would you be happy to mark @Hubert Dudek​'s answer as best if it solved the problem? That will help other members find the answer more quickly. Thanks 🙂

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.