How can I change the parquet compression algorithm from gzip to something else?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-15-2015 11:45 AM
Spark, by default, uses gzip to store parquet files. I would like to change the compression algorithm from gzip to snappy or lz4.
- Labels:
-
Compression
-
SQL
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-15-2015 11:46 AM
You can set the following spark sql property spark.sql.parquet.compression.codec.
In sql:
%sql set spark.sql.parquet.compression.codec=snappy
You can also set in the sqlContext directly:
sqlContext.setConf("spark.sql.parquet.compression.codec.", "snappy")
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-06-2016 11:06 PM
Note the above has a slight typo
You can also set in the sqlContext directly: sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
Unfortunately it appears that lz4 isnt supported as a parquet compression codec. Im not sure why as lz4 is supported for io.codec.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-28-2016 02:01 PM
What are the options if I don't need any compression while writing my dataframe to HDFS as parquet format ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-09-2017 09:26 AM
sqlContext.setConf("spark.sql.parquet.compression.codec", "uncompressed")
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-28-2016 03:34 PM
@karthik.thati - Try this
df.write.option("compression","none").mode("overwrite").save("testoutput.parquet")
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-09-2017 09:44 AM
For uncompressed use
sqlContext.setConf("spark.sql.parquet.compression.codec", "uncompressed")
The value highlighted could be one of the four : uncompressed, snappy, gzip, lzo
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-31-2017 06:31 AM
@prakash573: I
I guess spark uses "Snappy" compression for parquet file by default. I'm referring Spark's official document "Learning Spark" , Chapter 9, page # 182, Table 9-3.
Please confirm if this is not correct.
Thank You
Venkat Anampudi
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-16-2020 02:47 AM
Starting from spark version 2.1.0,"snappy" is the default compression and before that version "gzip" is default compression format in spark.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-01-2019 02:10 AM
spark.sql("set spark.sql.parquet.compression.codec=gzip");

