topic Re: How can I change the parquet compression algorithm from gzip to something else? in Data Engineering

How can I change the parquet compression algorithm from gzip to something else?

User16301467532 — Wed, 15 Jul 2015 18:45:24 GMT

Spark, by default, uses gzip to store parquet files. I would like to change the compression algorithm from gzip to snappy or lz4.

Re: How can I change the parquet compression algorithm from gzip to something else?

User16301467532 — Wed, 15 Jul 2015 18:46:35 GMT

You can set the following spark sql property spark.sql.parquet.compression.codec.

In sql:

%sql set spark.sql.parquet.compression.codec=snappy

You can also set in the sqlContext directly:

sqlContext.setConf("spark.sql.parquet.compression.codec.", "snappy")

Re: How can I change the parquet compression algorithm from gzip to something else?

JohnCavanaugh — Sat, 07 May 2016 06:06:30 GMT

Note the above has a slight typo

You can also set in the sqlContext directly: sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")

Unfortunately it appears that lz4 isnt supported as a parquet compression codec. Im not sure why as lz4 is supported for io.codec.

Re: How can I change the parquet compression algorithm from gzip to something else?

karthik_thati — Thu, 28 Jul 2016 21:01:24 GMT

What are the options if I don't need any compression while writing my dataframe to HDFS as parquet format ?

Re: How can I change the parquet compression algorithm from gzip to something else?

girivaratharaja — Thu, 28 Jul 2016 22:34:39 GMT

@karthik.thati - Try this

df.write.option("compression","none").mode("overwrite").save("testoutput.parquet")

Re: How can I change the parquet compression algorithm from gzip to something else?

sujoyDutta — Fri, 09 Jun 2017 16:26:44 GMT

sqlContext.setConf("spark.sql.parquet.compression.codec", "uncompressed")

Re: How can I change the parquet compression algorithm from gzip to something else?

sujoyDutta — Fri, 09 Jun 2017 16:44:23 GMT

For uncompressed use

sqlContext.setConf("spark.sql.parquet.compression.codec", "uncompressed")

The value highlighted could be one of the four : uncompressed, snappy, gzip, lzo

Re: How can I change the parquet compression algorithm from gzip to something else?

venkat_anampudi — Sun, 31 Dec 2017 14:31:29 GMT

@prakash573: I

I guess spark uses "Snappy" compression for parquet file by default. I'm referring Spark's official document "Learning Spark" , Chapter 9, page # 182, Table 9-3.

Please confirm if this is not correct.

Thank You

Venkat Anampudi

Re: How can I change the parquet compression algorithm from gzip to something else?

ZhenZeng — Tue, 01 Oct 2019 09:10:05 GMT

spark.sql("set spark.sql.parquet.compression.codec=gzip");

Re: How can I change the parquet compression algorithm from gzip to something else?

Pooja1 — Thu, 16 Jan 2020 10:47:34 GMT

Starting from spark version 2.1.0,"snappy" is the default compression and before that version "gzip" is default compression format in spark.