cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I change the parquet compression algorithm from gzip to something else?

User16301467532
New Contributor II

Spark, by default, uses gzip to store parquet files. I would like to change the compression algorithm from gzip to snappy or lz4.

9 REPLIES 9

User16301467532
New Contributor II

You can set the following spark sql property spark.sql.parquet.compression.codec.

In sql:

%sql set spark.sql.parquet.compression.codec=snappy

You can also set in the sqlContext directly:

sqlContext.setConf("spark.sql.parquet.compression.codec.", "snappy")

JohnCavanaugh
New Contributor II

Note the above has a slight typo

You can also set in the sqlContext directly: sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")

Unfortunately it appears that lz4 isnt supported as a parquet compression codec. Im not sure why as lz4 is supported for io.codec.

karthik_thati
New Contributor II

What are the options if I don't need any compression while writing my dataframe to HDFS as parquet format ?

sqlContext.setConf("spark.sql.parquet.compression.codec", "uncompressed")

girivaratharaja
New Contributor III

@karthik.thati​ - Try this

df.write.option("compression","none").mode("overwrite").save("testoutput.parquet")

sujoyDutta
New Contributor II

For uncompressed use

sqlContext.setConf("spark.sql.parquet.compression.codec", "uncompressed")

The value highlighted could be one of the four : uncompressed, snappy, gzip, lzo

venkat_anampudi
New Contributor II

@prakash573: I

I guess spark uses "Snappy" compression for parquet file by default. I'm referring Spark's official document "Learning Spark" , Chapter 9, page # 182, Table 9-3.

Please confirm if this is not correct.

Thank You

Venkat Anampudi

Starting from spark version 2.1.0,"snappy" is the default compression and before that version "gzip" is default compression format in spark.

ZhenZeng
New Contributor II

spark.sql("set spark.sql.parquet.compression.codec=gzip");

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.