cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How can I change the parquet compression algorithm from gzip to something else?

User16301467532
New Contributor II

Spark, by default, uses gzip to store parquet files. I would like to change the compression algorithm from gzip to snappy or lz4.

9 REPLIES 9

User16301467532
New Contributor II

You can set the following spark sql property spark.sql.parquet.compression.codec.

In sql:

%sql set spark.sql.parquet.compression.codec=snappy

You can also set in the sqlContext directly:

sqlContext.setConf("spark.sql.parquet.compression.codec.", "snappy")

JohnCavanaugh
New Contributor II

Note the above has a slight typo

You can also set in the sqlContext directly: sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")

Unfortunately it appears that lz4 isnt supported as a parquet compression codec. Im not sure why as lz4 is supported for io.codec.

karthik_thati
New Contributor II

What are the options if I don't need any compression while writing my dataframe to HDFS as parquet format ?

sqlContext.setConf("spark.sql.parquet.compression.codec", "uncompressed")

girivaratharaja
New Contributor III

@karthik.thatiโ€‹ - Try this

df.write.option("compression","none").mode("overwrite").save("testoutput.parquet")

sujoyDutta
New Contributor II

For uncompressed use

sqlContext.setConf("spark.sql.parquet.compression.codec", "uncompressed")

The value highlighted could be one of the four : uncompressed, snappy, gzip, lzo

venkat_anampudi
New Contributor II

@prakash573: I

I guess spark uses "Snappy" compression for parquet file by default. I'm referring Spark's official document "Learning Spark" , Chapter 9, page # 182, Table 9-3.

Please confirm if this is not correct.

Thank You

Venkat Anampudi

Starting from spark version 2.1.0,"snappy" is the default compression and before that version "gzip" is default compression format in spark.

ZhenZeng
New Contributor II

spark.sql("set spark.sql.parquet.compression.codec=gzip");

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group