cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I change the parquet compression algorithm from gzip to something else?

User16301467532
New Contributor II

Spark, by default, uses gzip to store parquet files. I would like to change the compression algorithm from gzip to snappy or lz4.

9 REPLIES 9

User16301467532
New Contributor II

You can set the following spark sql property spark.sql.parquet.compression.codec.

In sql:

%sql set spark.sql.parquet.compression.codec=snappy

You can also set in the sqlContext directly:

sqlContext.setConf("spark.sql.parquet.compression.codec.", "snappy")

JohnCavanaugh
New Contributor II

Note the above has a slight typo

You can also set in the sqlContext directly: sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")

Unfortunately it appears that lz4 isnt supported as a parquet compression codec. Im not sure why as lz4 is supported for io.codec.

karthik_thati
New Contributor II

What are the options if I don't need any compression while writing my dataframe to HDFS as parquet format ?

sqlContext.setConf("spark.sql.parquet.compression.codec", "uncompressed")

girivaratharaja
New Contributor III

@karthik.thati​ - Try this

df.write.option("compression","none").mode("overwrite").save("testoutput.parquet")

sujoyDutta
New Contributor II

For uncompressed use

sqlContext.setConf("spark.sql.parquet.compression.codec", "uncompressed")

The value highlighted could be one of the four : uncompressed, snappy, gzip, lzo

venkat_anampudi
New Contributor II

@prakash573: I

I guess spark uses "Snappy" compression for parquet file by default. I'm referring Spark's official document "Learning Spark" , Chapter 9, page # 182, Table 9-3.

Please confirm if this is not correct.

Thank You

Venkat Anampudi

Starting from spark version 2.1.0,"snappy" is the default compression and before that version "gzip" is default compression format in spark.

ZhenZeng
New Contributor II

spark.sql("set spark.sql.parquet.compression.codec=gzip");

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now