Databricks

Anonymous · ‎06-22-2021

For example, let's say I have a file called

some-file

, which is a gzipped text file. If I try

spark.read.text('some-file')

, it will return a bunch of gibberish since it doesn't know that the file is gzipped. I'm looking to manually tell spark the file is gzipped and decode it based on that. I did some searching but don't see a good answer to the question or the answers say you can't.

sean_owen · ‎06-22-2021

Other than renaming the file, I'm not sure you can do much - figuring out how to read the compressed file happens a bit below Spark, in Hadoop APIs, and looking at the source it seems to definitely key off the file name.

If they aren't big files, you can load the bytes of the files with .load("binaryFiles") and then apply a UDF that gunzips the file with a library, and then interpret the bytes as a string. In Scala you can then interpret that as a Dataset[String] and actually pass it to things like spark.read.csv; not sure you can do the same in Python. But that at least gets you the whole text of each file.

Francie · ‎03-13-2022

The community is field for the approval of the terms. The struggle of a great site is recommend for the norms. The value is suggested for the top of the vital paths for the finding members.

Databricks

How to read a compressed file in spark if the filename does not include the file extension for that compression format?

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI