<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How to set size of Parquet output files ? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-set-size-of-parquet-output-files/m-p/30478#M22103</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hi &lt;/P&gt;
&lt;P&gt;I'm using Parquet for format to store Raw Data. Actually the part file are stored on S3&lt;/P&gt;
&lt;P&gt;I would like to control the file size of each parquet part file.&lt;/P&gt;
&lt;P&gt;I try this &lt;/P&gt;
&lt;P&gt;sqlContext.setConf("spark.parquet.block.size", SIZE.toString) &lt;/P&gt;
&lt;P&gt;sqlContext.setConf("spark.dfs.blocksize", SIZE.toString)&lt;/P&gt;
&lt;P&gt;But not seems to work. Could you help me?&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 19 May 2015 09:57:02 GMT</pubDate>
    <dc:creator>richard1_558848</dc:creator>
    <dc:date>2015-05-19T09:57:02Z</dc:date>
    <item>
      <title>How to set size of Parquet output files ?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-set-size-of-parquet-output-files/m-p/30478#M22103</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hi &lt;/P&gt;
&lt;P&gt;I'm using Parquet for format to store Raw Data. Actually the part file are stored on S3&lt;/P&gt;
&lt;P&gt;I would like to control the file size of each parquet part file.&lt;/P&gt;
&lt;P&gt;I try this &lt;/P&gt;
&lt;P&gt;sqlContext.setConf("spark.parquet.block.size", SIZE.toString) &lt;/P&gt;
&lt;P&gt;sqlContext.setConf("spark.dfs.blocksize", SIZE.toString)&lt;/P&gt;
&lt;P&gt;But not seems to work. Could you help me?&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 19 May 2015 09:57:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-set-size-of-parquet-output-files/m-p/30478#M22103</guid>
      <dc:creator>richard1_558848</dc:creator>
      <dc:date>2015-05-19T09:57:02Z</dc:date>
    </item>
    <item>
      <title>Re: How to set size of Parquet output files ?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-set-size-of-parquet-output-files/m-p/30479#M22104</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Any Information ?&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 20 May 2015 18:36:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-set-size-of-parquet-output-files/m-p/30479#M22104</guid>
      <dc:creator>richard1_558848</dc:creator>
      <dc:date>2015-05-20T18:36:32Z</dc:date>
    </item>
    <item>
      <title>Re: How to set size of Parquet output files ?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-set-size-of-parquet-output-files/m-p/30480#M22105</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Try this (in 1.4.0):&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;val blockSize = 1024 * 1024 * 16      // 16MB
sc.hadoopConfiguration.setInt( "dfs.blocksize", blockSize )
sc.hadoopConfiguration.setInt( "parquet.block.size", blockSize )&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Where &lt;I&gt;sc&lt;/I&gt; is your SparkContext (not SQLContext).&lt;/P&gt;
&lt;P&gt;Not that there also appears to be "page size" and "dictionary page size" parameters that interact with the block size; e.g., page size should not exceed block size. I have them all with the exact same value, and that got me through.&lt;/P&gt;
&lt;P&gt;It looks like Spark will allocate 1 block &lt;B&gt;in memory&lt;/B&gt; for every Parquet partition you output, so if you are creating a large number of Parquet partitions you could quickly hit OutOfMemory errors.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 30 Jun 2015 14:02:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-set-size-of-parquet-output-files/m-p/30480#M22105</guid>
      <dc:creator>__rake</dc:creator>
      <dc:date>2015-06-30T14:02:20Z</dc:date>
    </item>
    <item>
      <title>Re: How to set size of Parquet output files ?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-set-size-of-parquet-output-files/m-p/30481#M22106</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hi All can anyone tell me what is the default Raw Group size while writing via SparkSql &lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 05 Jan 2017 05:58:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-set-size-of-parquet-output-files/m-p/30481#M22106</guid>
      <dc:creator>manjeet_chandho</dc:creator>
      <dc:date>2017-01-05T05:58:14Z</dc:date>
    </item>
  </channel>
</rss>

