<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How do I create a single CSV file from multiple partitions in Databricks / Spark? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29974#M21661</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Is FileUtils.copyMerge() is supported in databricks in DBFS? &lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 25 Dec 2020 04:59:37 GMT</pubDate>
    <dc:creator>Rampatel5</dc:creator>
    <dc:date>2020-12-25T04:59:37Z</dc:date>
    <item>
      <title>How do I create a single CSV file from multiple partitions in Databricks / Spark?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29962#M21649</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Using sparkcsv to write data to dbfs, which I plan to move to my laptop via standard s3 copy commands.&lt;/P&gt;
&lt;P&gt;The default for spark csv is to write output into partitions. I can force it to a single partition, but would really like to know if there is a generic way to do this.&lt;/P&gt;
&lt;P&gt;In a hadoop file system, I'd simply run something like&lt;/P&gt;
&lt;P&gt;hadoop fs -getmerge /user/hadoop/dir1/ ./myoutput.txt&lt;/P&gt;
&lt;P&gt;Any equivalent from within the databricks platform?&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 02 Dec 2015 18:26:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29962#M21649</guid>
      <dc:creator>rlgarris</dc:creator>
      <dc:date>2015-12-02T18:26:01Z</dc:date>
    </item>
    <item>
      <title>Re: How do I create a single CSV file from multiple partitions in Databricks / Spark?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29963#M21650</link>
      <description>&lt;P&gt;If the data isn't more than a few GB then you can coalesce your dataset prior to writing it out.&lt;/P&gt;&lt;P&gt;Something like:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df.coalesce(1).write.format("com.databricks.spark.cvs").save("...path...")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;then copy to a single file using a dbutils.fs command: &lt;/P&gt;&lt;P&gt;dbutils.fs.cp("...path...", "..path.. ..csv")&lt;/P&gt;</description>
      <pubDate>Wed, 02 Dec 2015 18:26:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29963#M21650</guid>
      <dc:creator>rlgarris</dc:creator>
      <dc:date>2015-12-02T18:26:21Z</dc:date>
    </item>
    <item>
      <title>Re: How do I create a single CSV file from multiple partitions in Databricks / Spark?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29964#M21651</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Thanks Richard. That is useful for single files. I'll add it to our local docs. I ended up writing a shell script that downloads all parts and merges them locally, so that can remain an option for people with larger files.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 02 Dec 2015 18:26:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29964#M21651</guid>
      <dc:creator>rlgarris</dc:creator>
      <dc:date>2015-12-02T18:26:37Z</dc:date>
    </item>
    <item>
      <title>Re: How do I create a single CSV file from multiple partitions in Databricks / Spark?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29965#M21652</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Any tips if the data is more than a few GB? Obviously the concern is a call to coalesce will bring all data into drive memory.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 19 Aug 2016 00:34:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29965#M21652</guid>
      <dc:creator>ChrisJohnson</dc:creator>
      <dc:date>2016-08-19T00:34:47Z</dc:date>
    </item>
    <item>
      <title>Re: How do I create a single CSV file from multiple partitions in Databricks / Spark?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29966#M21653</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;THIS IS TERRIBLE ADVICE. DO NOT USE the DataFrame methods of .coalesce(1) or .repartition(1) except for very small data sets. Instead, use the hdfs merge mechanism via FileUtils.copyMerge(). This solution on StackOverflow correctly identifies how to do this:&lt;/P&gt;
&lt;P&gt;&lt;A href="http://stackoverflow.com/a/41785085/501113" target="test_blank"&gt;http://stackoverflow.com/a/41785085/501113&lt;/A&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 31 Mar 2017 19:18:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29966#M21653</guid>
      <dc:creator>chaotic3quilibr</dc:creator>
      <dc:date>2017-03-31T19:18:33Z</dc:date>
    </item>
    <item>
      <title>Re: How do I create a single CSV file from multiple partitions in Databricks / Spark?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29967#M21654</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Please look this StackOverflow answer for the most effective way to use the HDFS FileUtils.copyMerge() command:&lt;/P&gt;
&lt;P&gt;&lt;A href="http://stackoverflow.com/a/41785085/501113" target="test_blank"&gt;http://stackoverflow.com/a/41785085/501113&lt;/A&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 31 Mar 2017 19:21:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29967#M21654</guid>
      <dc:creator>chaotic3quilibr</dc:creator>
      <dc:date>2017-03-31T19:21:21Z</dc:date>
    </item>
    <item>
      <title>Re: How do I create a single CSV file from multiple partitions in Databricks / Spark?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29968#M21655</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Please look this StackOverflow answer for the most effective way to use the HDFS FileUtils.copyMerge() command:&lt;/P&gt;
&lt;P&gt;&lt;A href="http://stackoverflow.com/a/41785085/501113" target="test_blank"&gt;http://stackoverflow.com/a/41785085/501113&lt;/A&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 31 Mar 2017 19:21:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29968#M21655</guid>
      <dc:creator>chaotic3quilibr</dc:creator>
      <dc:date>2017-03-31T19:21:58Z</dc:date>
    </item>
    <item>
      <title>Re: How do I create a single CSV file from multiple partitions in Databricks / Spark?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29969#M21656</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;If you can fit all the data into RAM on one worker (and thus can use .coalesce(1)), you can use dbfs to find and move the resulting CSV file:&lt;/P&gt;val fileprefix= "/mnt/aws/path/file-prefix"
&lt;P&gt;&lt;/P&gt; 
&lt;P&gt;dataset .coalesce(1) &lt;/P&gt;&lt;P&gt;&lt;/P&gt; .write &lt;P&gt;&lt;/P&gt;//.mode("overwrite") // I usually don't use this, but you may want to. 
&lt;P&gt; .option("header", "true") .option("delimiter","\t") .csv(fileprefix+".tmp")&lt;/P&gt; 
&lt;P&gt;val partition_path = dbutils.fs.ls(fileprefix+".tmp/") .filter(file=&amp;gt;file.name.endsWith(".csv"))(0).path&lt;/P&gt; 
&lt;P&gt;dbutils.fs.cp(partition_path,fileprefix+".tab")&lt;/P&gt; 
&lt;P&gt;dbutils.fs.rm(fileprefix+".tmp",recurse=true) &lt;/P&gt;
&lt;P&gt;If your file does not fit into RAM on the worker, you may want to consider chaoticequilibrium's suggestion to use FileUtils.copyMerge(). I have not done this, and don't yet know if is possible or not, e.g., on S3.&lt;/P&gt;
&lt;P&gt;Sources:&lt;/P&gt;
&lt;UL&gt;&lt;LI&gt;Stack Overflow: Writing single CSV file&lt;/LI&gt;&lt;/UL&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 27 Jul 2017 20:10:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29969#M21656</guid>
      <dc:creator>JosiahYoder</dc:creator>
      <dc:date>2017-07-27T20:10:18Z</dc:date>
    </item>
    <item>
      <title>Re: How do I create a single CSV file from multiple partitions in Databricks / Spark?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29970#M21657</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I'm really miffed that my formatting of the code disappears when I commit the edit.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 27 Jul 2017 20:11:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29970#M21657</guid>
      <dc:creator>JosiahYoder</dc:creator>
      <dc:date>2017-07-27T20:11:25Z</dc:date>
    </item>
    <item>
      <title>Re: How do I create a single CSV file from multiple partitions in Databricks / Spark?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29971#M21658</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;See my embellishment of this answer, filling out the ...s in the "...path...": &lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 27 Jul 2017 20:14:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29971#M21658</guid>
      <dc:creator>JosiahYoder</dc:creator>
      <dc:date>2017-07-27T20:14:38Z</dc:date>
    </item>
    <item>
      <title>Re: How do I create a single CSV file from multiple partitions in Databricks / Spark?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29972#M21659</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;You need to set the recursive setting on the copy command. Matthew Gascoyne explained it in detail in one of his posts: &lt;/P&gt;
&lt;P&gt;&lt;A target="_blank" href="https://"&gt;When trying to copy a folder from one location to another in Databricks you may&lt;/A&gt; &lt;A target="_blank" href="https://"&gt;write my paper&lt;/A&gt; &lt;A target="_blank" href="https://"&gt;tasks and run into the below message&lt;/A&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 03 Jul 2019 11:09:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29972#M21659</guid>
      <dc:creator>RandyBonnette</dc:creator>
      <dc:date>2019-07-03T11:09:13Z</dc:date>
    </item>
    <item>
      <title>Re: How do I create a single CSV file from multiple partitions in Databricks / Spark?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29973#M21660</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Without access to bash it would be highly appreciated if an option within databricks (e.g. via dbfsutils) existed.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jan 2020 11:50:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29973#M21660</guid>
      <dc:creator>ChristianHomber</dc:creator>
      <dc:date>2020-01-21T11:50:40Z</dc:date>
    </item>
    <item>
      <title>Re: How do I create a single CSV file from multiple partitions in Databricks / Spark?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29974#M21661</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Is FileUtils.copyMerge() is supported in databricks in DBFS? &lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 25 Dec 2020 04:59:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-do-i-create-a-single-csv-file-from-multiple-partitions-in/m-p/29974#M21661</guid>
      <dc:creator>Rampatel5</dc:creator>
      <dc:date>2020-12-25T04:59:37Z</dc:date>
    </item>
  </channel>
</rss>

