How to read a compressed file in spark if the filename does not include the file extension for that compression format?
![](/skins/images/F150478535D6FB5A5FF0311D4528FC89/responsive_peak/images/icon_anonymous_profile.png)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-22-2021 06:13 PM
For example, let's say I have a file called
some-file
, which is a gzipped text file. If I try
spark.read.text('some-file')
, it will return a bunch of gibberish since it doesn't know that the file is gzipped. I'm looking to manually tell spark the file is gzipped and decode it based on that. I did some searching but don't see a good answer to the question or the answers say you can't.
- Labels:
-
File
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-22-2021 06:32 PM
Other than renaming the file, I'm not sure you can do much - figuring out how to read the compressed file happens a bit below Spark, in Hadoop APIs, and looking at the source it seems to definitely key off the file name.
If they aren't big files, you can load the bytes of the files with .load("binaryFiles") and then apply a UDF that gunzips the file with a library, and then interpret the bytes as a string. In Scala you can then interpret that as a Dataset[String] and actually pass it to things like spark.read.csv; not sure you can do the same in Python. But that at least gets you the whole text of each file.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-13-2022 07:24 AM
The community is field for the approval of the terms. The struggle of a great site is recommend for the norms. The value is suggested for the top of the vital paths for the finding members.
![](/skins/images/582998B45490C7019731A5B3A872C751/responsive_peak/images/icon_anonymous_message.png)
![](/skins/images/582998B45490C7019731A5B3A872C751/responsive_peak/images/icon_anonymous_message.png)