cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to set size of Parquet output files ?

richard1_558848
New Contributor II

Hi

I'm using Parquet for format to store Raw Data. Actually the part file are stored on S3

I would like to control the file size of each parquet part file.

I try this

sqlContext.setConf("spark.parquet.block.size", SIZE.toString)

sqlContext.setConf("spark.dfs.blocksize", SIZE.toString)

But not seems to work. Could you help me?

3 REPLIES 3

richard1_558848
New Contributor II

Any Information ?

__rake
New Contributor II

Try this (in 1.4.0):

val blockSize = 1024 * 1024 * 16      // 16MB
sc.hadoopConfiguration.setInt( "dfs.blocksize", blockSize )
sc.hadoopConfiguration.setInt( "parquet.block.size", blockSize )

Where sc is your SparkContext (not SQLContext).

Not that there also appears to be "page size" and "dictionary page size" parameters that interact with the block size; e.g., page size should not exceed block size. I have them all with the exact same value, and that got me through.

It looks like Spark will allocate 1 block in memory for every Parquet partition you output, so if you are creating a large number of Parquet partitions you could quickly hit OutOfMemory errors.

manjeet_chandho
New Contributor II

Hi All can anyone tell me what is the default Raw Group size while writing via SparkSql

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group