cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to read a compressed file in spark if the filename does not include the file extension for that compression format?

Anonymous
Not applicable

For example, let's say I have a file called 

some-file

, which is a gzipped text file. If I try 

spark.read.text('some-file')

, it will return a bunch of gibberish since it doesn't know that the file is gzipped. I'm looking to manually tell spark the file is gzipped and decode it based on that. I did some searching but don't see a good answer to the question or the answers say you can't.

2 REPLIES 2

sean_owen
Honored Contributor II
Honored Contributor II

Other than renaming the file, I'm not sure you can do much - figuring out how to read the compressed file happens a bit below Spark, in Hadoop APIs, and looking at the source it seems to definitely key off the file name.

If they aren't big files, you can load the bytes of the files with .load("binaryFiles") and then apply a UDF that gunzips the file with a library, and then interpret the bytes as a string. In Scala you can then interpret that as a Dataset[String] and actually pass it to things like spark.read.csv; not sure you can do the same in Python. But that at least gets you the whole text of each file.

Francie
New Contributor II

The community is field for the approval of the terms. The struggle of a great site is recommend for the norms. The value is suggested for the top of the vital paths for the finding members.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.