<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Parquet file merging or other optimisation tips in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/parquet-file-merging-or-other-optimisation-tips/m-p/30321#M21964</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;I need some guide lines for a performance issue with Parquet files :&lt;/P&gt;
&lt;P&gt;I am loading a set of parquet files using : df = sqlContext.parquetFile( folder_path )&lt;/P&gt;
&lt;P&gt;My parquet folder has 6 sub division keys&lt;/P&gt;
&lt;P&gt;It was initially ok with a first sample of data organized this way so I stared pushing more and performance is slowing down very quickly as I do so&lt;/P&gt;
&lt;P&gt;Because the way data arrives every day the above folder partition is "natural" BUT it leads to small fies which I read is a bottleneck explanation&lt;/P&gt;
&lt;P&gt;Shall I merge several of of sub folders in a second phase ? If so what function (python API) shall I use for this ?&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 22 Jul 2015 20:15:47 GMT</pubDate>
    <dc:creator>xxMathieuxxZara</dc:creator>
    <dc:date>2015-07-22T20:15:47Z</dc:date>
    <item>
      <title>Parquet file merging or other optimisation tips</title>
      <link>https://community.databricks.com/t5/data-engineering/parquet-file-merging-or-other-optimisation-tips/m-p/30321#M21964</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;I need some guide lines for a performance issue with Parquet files :&lt;/P&gt;
&lt;P&gt;I am loading a set of parquet files using : df = sqlContext.parquetFile( folder_path )&lt;/P&gt;
&lt;P&gt;My parquet folder has 6 sub division keys&lt;/P&gt;
&lt;P&gt;It was initially ok with a first sample of data organized this way so I stared pushing more and performance is slowing down very quickly as I do so&lt;/P&gt;
&lt;P&gt;Because the way data arrives every day the above folder partition is "natural" BUT it leads to small fies which I read is a bottleneck explanation&lt;/P&gt;
&lt;P&gt;Shall I merge several of of sub folders in a second phase ? If so what function (python API) shall I use for this ?&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 22 Jul 2015 20:15:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parquet-file-merging-or-other-optimisation-tips/m-p/30321#M21964</guid>
      <dc:creator>xxMathieuxxZara</dc:creator>
      <dc:date>2015-07-22T20:15:47Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet file merging or other optimisation tips</title>
      <link>https://community.databricks.com/t5/data-engineering/parquet-file-merging-or-other-optimisation-tips/m-p/30322#M21965</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hi Mzaradzki -&lt;/P&gt;
&lt;P&gt;In Spark 1.5 which we will be adding a feature to improve metadata caching in parquet specifically so it should greatly improve performance for your use case above.&lt;/P&gt;
&lt;P&gt;One option to improve performance in Databricks is to use the dbutils.fs.cacheFiles function to move your parquet files to the SSDs attached to the workers in your cluster. &lt;/P&gt;
&lt;P&gt;Cheers,&lt;/P&gt;
&lt;P&gt;Richard&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 24 Jul 2015 17:17:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parquet-file-merging-or-other-optimisation-tips/m-p/30322#M21965</guid>
      <dc:creator>rlgarris</dc:creator>
      <dc:date>2015-07-24T17:17:56Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet file merging or other optimisation tips</title>
      <link>https://community.databricks.com/t5/data-engineering/parquet-file-merging-or-other-optimisation-tips/m-p/30323#M21966</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;There are a couple of SQL optimizations I recommend for you to consider.&lt;/P&gt;
&lt;P&gt;1) Making use of partitions for your table may help if you frequently only access data from certain days at a time. There's a notebook in the Databricks Guide called "Partitioned Tables" with more data. &lt;/P&gt;
&lt;P&gt;2) If your files are really small - it is true that you may get better performance by consolidating those files into a smaller number. You can do that easily in spark with a command like this:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;sqlContext.parquetFile( SOME_INPUT_FILEPATTERN )
          .coalesce(SOME_SMALLER_NUMBER_OF_DESIRED_PARTITIONS)
          .write.parquet(SOME_OUTPUT_DIRECTORY)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 24 Jul 2015 17:22:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parquet-file-merging-or-other-optimisation-tips/m-p/30323#M21966</guid>
      <dc:creator>vida</dc:creator>
      <dc:date>2015-07-24T17:22:55Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet file merging or other optimisation tips</title>
      <link>https://community.databricks.com/t5/data-engineering/parquet-file-merging-or-other-optimisation-tips/m-p/30324#M21967</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Having a large # of small files or folders can significantly deteriorate the performance of loading the data. The best way is to keep the folders/files merged so that each file is around 64MB size. There are different ways to achieve this: your writer process can either buffer them in memory and write only after reaching a size or as a second phase you can read the temp directory and consolidate them together and write it out to a different location. If you want to do the latter, you can read each of your input directory as a dataframe and union them and repartition it to the # of files you want and dump it back. A code snippet in Scala would be:&lt;/P&gt;val dfSeq = MutableList[DataFrame]()
&lt;P&gt;&lt;/P&gt; 
&lt;P&gt;sourceDirsToConsolidate.map(dir =&amp;gt; { val df = sqlContext.parquetFile(dir) dfSeq += df })&lt;/P&gt; 
&lt;P&gt;val masterDf = dfSeq.reduce((df1, df2) =&amp;gt; df1.unionAll(df2)) masterDf.coalesce(numOutputFiles).write.mode(saveMode).parquet(destDir)&lt;/P&gt;
&lt;P&gt;The dataframe's api is same in python. So you might be able to easily convert this to python.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 24 Jul 2015 17:28:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parquet-file-merging-or-other-optimisation-tips/m-p/30324#M21967</guid>
      <dc:creator>User16301467532</dc:creator>
      <dc:date>2015-07-24T17:28:19Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet file merging or other optimisation tips</title>
      <link>https://community.databricks.com/t5/data-engineering/parquet-file-merging-or-other-optimisation-tips/m-p/30325#M21968</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hi Richard, &lt;/P&gt;
&lt;P&gt;Will this actually parallelize reading the footers? Or just help for Spark-generated parquet files? WRT to the serialized footer reading, I haven't noticed large gains with caching the files on the ssds. &lt;/P&gt;
&lt;P&gt;Cheers,&lt;/P&gt;
&lt;P&gt;Ken&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 24 Jul 2015 20:37:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parquet-file-merging-or-other-optimisation-tips/m-p/30325#M21968</guid>
      <dc:creator>KennethYocum</dc:creator>
      <dc:date>2015-07-24T20:37:17Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet file merging or other optimisation tips</title>
      <link>https://community.databricks.com/t5/data-engineering/parquet-file-merging-or-other-optimisation-tips/m-p/30326#M21969</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hi Prakash,&lt;/P&gt;
&lt;P&gt;I am trying to transfer parquet files from hadoop on prem to S3 , i am able to move normal HDFS file's but when it comes to parquet it is not working properly . &lt;/P&gt;
&lt;P&gt;Do you have any clue how do we transfer parquet files from HDFS to S3 ?&lt;/P&gt;
&lt;P&gt;Appreciate your response.&lt;/P&gt;
&lt;P&gt;Thanks&lt;/P&gt;
&lt;P&gt;Ishan &lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 16 Jul 2017 19:22:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parquet-file-merging-or-other-optimisation-tips/m-p/30326#M21969</guid>
      <dc:creator>ishangaur</dc:creator>
      <dc:date>2017-07-16T19:22:51Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet file merging or other optimisation tips</title>
      <link>https://community.databricks.com/t5/data-engineering/parquet-file-merging-or-other-optimisation-tips/m-p/30327#M21970</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I have multiple small parquet files in all partitions , this is legacy data , want to merge files in individual partitions directories to single files. how can we achieve this.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 27 Aug 2019 06:25:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parquet-file-merging-or-other-optimisation-tips/m-p/30327#M21970</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2019-08-27T06:25:46Z</dc:date>
    </item>
  </channel>
</rss>

