cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Issue in reading parquet file in pyspark databricks.

irfanaziz
Contributor II

One of the source systems generates from time to time a parquet file which is only 220kb in size.

But reading it fails.

"java.io.IOException: Could not read or convert schema for file: 1-2022-00-51-56.parquet

Caused by: org.apache.spark.sql.AnalysisException: Parquet type not supported: INT32 (UINT_32);

"

I tried to use a schema and mergeSchema option

df =spark.read.options(mergeSchema=True).schema(mdd_schema_struct).parquet(target)

This is able to read the file and display but if you run count or merge it it would fail with

"Caused by: java.lang.RuntimeException: Illegal row group of 0 rows"

Does anyone know what could be the issue.

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

Seems that file is corrupted maybe you can ignore them by setting:

spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")

you can also check that setting:

sqlContext.setConf("spark.sql.parquet.filterPushdown","false")

you can register your files as table (pointed to that location with files) with correct schema set and than try to run:

%sql

MSCK REPAIR TABLE table_name

https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-repair-table.html

View solution in original post

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

Seems that file is corrupted maybe you can ignore them by setting:

spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")

you can also check that setting:

sqlContext.setConf("spark.sql.parquet.filterPushdown","false")

you can register your files as table (pointed to that location with files) with correct schema set and than try to run:

%sql

MSCK REPAIR TABLE table_name

https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-repair-table.html

irfanaziz
Contributor II

Yes i had to use the badRows option. Which put the bad files to a given path.

Anonymous
Not applicable

@nafri A​ - Howdy! My name is Piper, and I'm a community moderator for Databricks. Would you be happy to mark @Hubert Dudek​'s answer as best if it solved the problem? That will help other members find the answer more quickly. Thanks 🙂

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!