Databricks

User16301467532 · ‎07-15-2015

Spark, by default, uses gzip to store parquet files. I would like to change the compression algorithm from gzip to snappy or lz4.

User16301467532 · ‎07-15-2015

You can set the following spark sql property spark.sql.parquet.compression.codec.

In sql:

%sql set spark.sql.parquet.compression.codec=snappy

You can also set in the sqlContext directly:

sqlContext.setConf("spark.sql.parquet.compression.codec.", "snappy")

JohnCavanaugh · ‎05-06-2016

Note the above has a slight typo

You can also set in the sqlContext directly: sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")

Unfortunately it appears that lz4 isnt supported as a parquet compression codec. Im not sure why as lz4 is supported for io.codec.

karthik_thati · ‎07-28-2016

What are the options if I don't need any compression while writing my dataframe to HDFS as parquet format ?

sujoyDutta · ‎06-09-2017

sqlContext.setConf("spark.sql.parquet.compression.codec", "uncompressed")

girivaratharaja · ‎07-28-2016

@karthik.thati - Try this

df.write.option("compression","none").mode("overwrite").save("testoutput.parquet")

sujoyDutta · ‎06-09-2017

For uncompressed use

sqlContext.setConf("spark.sql.parquet.compression.codec", "uncompressed")

The value highlighted could be one of the four : uncompressed, snappy, gzip, lzo

venkat_anampudi · ‎12-31-2017

@prakash573: I

I guess spark uses "Snappy" compression for parquet file by default. I'm referring Spark's official document "Learning Spark" , Chapter 9, page # 182, Table 9-3.

Please confirm if this is not correct.

Thank You

Venkat Anampudi

Pooja1 · ‎01-16-2020

Starting from spark version 2.1.0,"snappy" is the default compression and before that version "gzip" is default compression format in spark.

ZhenZeng · ‎10-01-2019

spark.sql("set spark.sql.parquet.compression.codec=gzip");

Databricks

How can I change the parquet compression algorithm from gzip to something else?

Announcing the General Availability of Databricks Asset Bundles

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Register now and save 50% on training at Data + AI Summit!