cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Spark read GZ file as corrupted data, when file extension having .GZ in upper case

hprasad
New Contributor III

if file is renamed with file_name.sv.gz (lower case extension) is working fine, if file_name.sv.GZ (upper case extension) the data is read as corrupted, means it simply reading compressed file as is.

hprasad_0-1705667590987.png

 

Regards,
Hari Prasad
8 REPLIES 8

Lakshay
Databricks Employee
Databricks Employee

I don't think .GZ(upper case) is a valid file extension. I have seen most systems compress the file using .gz(lower case) extension

hprasad
New Contributor III

I assume, if a .gz file is renamed as .GZ purposefully then we need to consider it as valid file format as gzip file. Cause that .GZ file still consist compressed data which is still valid.

Regards,
Hari Prasad

Lakshay
Databricks Employee
Databricks Employee

Agree but Spark infers the compression from your filename and Spark cannot infer the compression from .GZ format. You can read more about this in below article:

https://aws.plainenglish.io/demystifying-apache-spark-quirks-2c91ba2d3978

hprasad
New Contributor III

Yup, Spark does infer it from filename, I have been through spark code in Github.

Article is also refering to the internal code from Spark library. 

I assume we can add an exception to handle .GZ file as gzip by tweaking spark libraries. 

Regards,
Hari Prasad

Lakshay
Databricks Employee
Databricks Employee

Yes, we can do it but is it worth doing it? This is something you can discuss in a Jira ticket. 

hprasad
New Contributor III

I assume it should worth handling such thing, as filename or extension should not be a constraint to process data.

As we know it's a gzip file and we can pass the paramter to read it as gzip.

 

Thanks a lot for your responses @Lakshay . 

Regards,
Hari Prasad

Lakshay
Databricks Employee
Databricks Employee

Happy to help!

hprasad
New Contributor III

Recently I restarted look at a solution for this issue, I found out we can add few exception for allowing "GZ" in hadoop library as GzipCodec is invoked from there.

Regards,
Hari Prasad

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group