<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Writing part files in single text file in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/writing-part-files-in-single-text-file/m-p/81564#M36344</link>
    <description>&lt;P&gt;&lt;SPAN&gt;When writing a pyspark dataframe to a file, it will always write to a part file by default. This is because of partitions, even if there is only 1 partitions.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;To write into a single file you can convert the pyspark dataframe to a pandas dataframe and then write to target like so.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;df.toPandas().to_csv(file_path, header = True, index = False)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;You should be careful when dealing with very large files because when you convert to pandas, all the data from all nodes is brought to the driver so you can write to a single output. If you face OOM issues, you can try increasing the size of the driver node.&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 02 Aug 2024 02:27:56 GMT</pubDate>
    <dc:creator>Edthehead</dc:creator>
    <dc:date>2024-08-02T02:27:56Z</dc:date>
    <item>
      <title>Writing part files in single text file</title>
      <link>https://community.databricks.com/t5/data-engineering/writing-part-files-in-single-text-file/m-p/81500#M36332</link>
      <description>&lt;P&gt;i want to write all my part file into a single text file is there anything i can do&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Aug 2024 13:28:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/writing-part-files-in-single-text-file/m-p/81500#M36332</guid>
      <dc:creator>Manthansingh</dc:creator>
      <dc:date>2024-08-01T13:28:42Z</dc:date>
    </item>
    <item>
      <title>Re: Writing part files in single text file</title>
      <link>https://community.databricks.com/t5/data-engineering/writing-part-files-in-single-text-file/m-p/81501#M36333</link>
      <description>&lt;P&gt;coalesce with one partition might be your friend:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;(
  df
   .coalesce(1)
   .write.format('csv')
   .option('header', 'true')
   .save('one-file.csv')
)&lt;/LI-CODE&gt;</description>
      <pubDate>Thu, 01 Aug 2024 13:49:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/writing-part-files-in-single-text-file/m-p/81501#M36333</guid>
      <dc:creator>Witold</dc:creator>
      <dc:date>2024-08-01T13:49:27Z</dc:date>
    </item>
    <item>
      <title>Re: Writing part files in single text file</title>
      <link>https://community.databricks.com/t5/data-engineering/writing-part-files-in-single-text-file/m-p/81564#M36344</link>
      <description>&lt;P&gt;&lt;SPAN&gt;When writing a pyspark dataframe to a file, it will always write to a part file by default. This is because of partitions, even if there is only 1 partitions.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;To write into a single file you can convert the pyspark dataframe to a pandas dataframe and then write to target like so.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;df.toPandas().to_csv(file_path, header = True, index = False)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;You should be careful when dealing with very large files because when you convert to pandas, all the data from all nodes is brought to the driver so you can write to a single output. If you face OOM issues, you can try increasing the size of the driver node.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 02 Aug 2024 02:27:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/writing-part-files-in-single-text-file/m-p/81564#M36344</guid>
      <dc:creator>Edthehead</dc:creator>
      <dc:date>2024-08-02T02:27:56Z</dc:date>
    </item>
  </channel>
</rss>

