How to set size of Parquet output files ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-19-2015 02:57 AM
Hi
I'm using Parquet for format to store Raw Data. Actually the part file are stored on S3
I would like to control the file size of each parquet part file.
I try this
sqlContext.setConf("spark.parquet.block.size", SIZE.toString)
sqlContext.setConf("spark.dfs.blocksize", SIZE.toString)
But not seems to work. Could you help me?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-20-2015 11:36 AM
Any Information ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-30-2015 07:02 AM
Try this (in 1.4.0):
val blockSize = 1024 * 1024 * 16 // 16MB
sc.hadoopConfiguration.setInt( "dfs.blocksize", blockSize )
sc.hadoopConfiguration.setInt( "parquet.block.size", blockSize )
Where sc is your SparkContext (not SQLContext).
Not that there also appears to be "page size" and "dictionary page size" parameters that interact with the block size; e.g., page size should not exceed block size. I have them all with the exact same value, and that got me through.
It looks like Spark will allocate 1 block in memory for every Parquet partition you output, so if you are creating a large number of Parquet partitions you could quickly hit OutOfMemory errors.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-04-2017 09:58 PM
Hi All can anyone tell me what is the default Raw Group size while writing via SparkSql

