<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Writing back from notebook to blob storage as single file with UC configured databricks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/writing-back-from-notebook-to-blob-storage-as-single-file-with/m-p/110447#M43577</link>
    <description>&lt;P&gt;Hi Shivap&lt;/P&gt;&lt;P&gt;If you want to save a dataframe as a single file, you could consider to convert the pyspark dataframe to a pandas dataframe and then save it as file.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;path_single_file = '/Volumes/demo/raw/test/single'

# create sample dataframe
df = spark.createDataFrame(
    [(i, f'name_{i}', i*10, i%2 == 0, f'2025-02-{i+1:02d}') for i in range(1, 21)],
    ['id', 'name', 'value', 'is_even', 'date']
)

# convert df to pandas dataframe
pdf = df.toPandas()

# create folder, if not exists
import os
if not os.path.exists(path_single_file):
    os.makedirs(path_single_file)

pdf.to_parquet(f'{path_single_file}/my_df.parquet', index=False)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;When I then have a look into the directory, it looks like this:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="StefanKoch_0-1739858098639.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14913i1BDF719E793B3FF5/image-size/medium?v=v2&amp;amp;px=400" role="button" title="StefanKoch_0-1739858098639.png" alt="StefanKoch_0-1739858098639.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 18 Feb 2025 05:59:22 GMT</pubDate>
    <dc:creator>Stefan-Koch</dc:creator>
    <dc:date>2025-02-18T05:59:22Z</dc:date>
    <item>
      <title>Writing back from notebook to blob storage as single file with UC configured databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/writing-back-from-notebook-to-blob-storage-as-single-file-with/m-p/110433#M43573</link>
      <description>&lt;P&gt;I want to write a file from notebook to blob storage. we have configured unity catalog. When it writes it creates the folder name as the file name that I have provided and inside that it writes multiple files as show below. Can someone suggest me on how I can write it as single file -&amp;nbsp;&lt;/P&gt;&lt;P&gt;_committed_3484505682152580967&lt;BR /&gt;_started_3484505682152580967&lt;BR /&gt;_SUCCESS&lt;BR /&gt;part-00000-tid-34845056821525809&lt;BR /&gt;67-77c7321e-c7f1-4194-b5f2-e5194aa2eb52-740-1-c000.csv&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 18 Feb 2025 00:47:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/writing-back-from-notebook-to-blob-storage-as-single-file-with/m-p/110433#M43573</guid>
      <dc:creator>Shivap</dc:creator>
      <dc:date>2025-02-18T00:47:51Z</dc:date>
    </item>
    <item>
      <title>Re: Writing back from notebook to blob storage as single file with UC configured databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/writing-back-from-notebook-to-blob-storage-as-single-file-with/m-p/110434#M43574</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/139292"&gt;@Shivap&lt;/a&gt;,&lt;/P&gt;
&lt;P class="p1"&gt;To write a file as a single file, you can use the appropriate options in the write method to control the output format. The default behavior is generating multiple files such as _committed, _started, _SUCCESS, and part files because the underlying operation defaults to saving the data in a distributed manner.&lt;/P&gt;
&lt;P class="p1"&gt;Here are the steps to ensure you write a single file:&lt;/P&gt;
&lt;OL class="ol1"&gt;
&lt;LI class="li1"&gt;&lt;STRONG&gt;Configure the Output Path&lt;/STRONG&gt;: Specify the exact file path where you want the single file to be written.&lt;/LI&gt;
&lt;LI class="li1"&gt;&lt;STRONG&gt;Use Coalesce or Repartition&lt;/STRONG&gt;: If you're working with a DataFrame, use coalesce(1) to collect all data into one partition before writing. This will force Spark to write the data out as a single file.&lt;/LI&gt;
&lt;LI class="li1"&gt;&lt;STRONG&gt;Save the File with dbutils.fs.cp&lt;/STRONG&gt;: Write the DataFrame to a temporary path, then use dbutils.fs.cp to copy the resultant part file to the desired single file path.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P class="p1"&gt;Here is an example using these steps:&lt;/P&gt;
&lt;P class="p1"&gt;# Example DataFrame&lt;/P&gt;
&lt;P class="p1"&gt;df = spark.createDataFrame([(1, "a"), (2, "b"), (3, "c")], ["id", "value"])&lt;/P&gt;
&lt;P class="p1"&gt;# Write DataFrame to a temporary directory&lt;/P&gt;
&lt;P class="p1"&gt;temp_path = "dbfs:/tmp/output"&lt;/P&gt;
&lt;P class="p1"&gt;(df.coalesce(1) # Reduce to one partition&lt;/P&gt;
&lt;P class="p1"&gt;.write&lt;/P&gt;
&lt;P class="p1"&gt;.mode('overwrite')&lt;/P&gt;
&lt;P class="p1"&gt;.option('header', 'true')&lt;/P&gt;
&lt;P class="p1"&gt;.csv(temp_path))&lt;/P&gt;
&lt;P class="p1"&gt;# List the files in the temporary directory&lt;/P&gt;
&lt;P class="p1"&gt;files = dbutils.fs.ls(temp_path)&lt;/P&gt;
&lt;P class="p1"&gt;part_file = [file.path for file in files if file.name.startswith("part")][0]&lt;/P&gt;
&lt;P class="p1"&gt;# Define the final output path&lt;/P&gt;
&lt;P class="p1"&gt;single_file_path = "dbfs:/path/to/final_output.csv"&lt;/P&gt;
&lt;P class="p1"&gt;# Copy the part file to the final output path&lt;/P&gt;
&lt;P class="p1"&gt;dbutils.fs.cp(part_file, single_file_path)&lt;/P&gt;
&lt;P class="p1"&gt;# Clean up temporary directory&lt;/P&gt;
&lt;P class="p1"&gt;dbutils.fs.rm(temp_path, True)&lt;/P&gt;
&lt;P class="p1"&gt;This approach ensures that the data is written to a single file named final_output.csv in your specified location.&lt;/P&gt;
&lt;P class="p1"&gt;Remember, this technique might not be optimal for very large datasets due to the overhead of collecting data to a single partition. For large datasets, consider using appropriate partitioning strategies or handling multiple part files appropriately on the consuming side.&lt;/P&gt;</description>
      <pubDate>Tue, 18 Feb 2025 01:49:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/writing-back-from-notebook-to-blob-storage-as-single-file-with/m-p/110434#M43574</guid>
      <dc:creator>Alberto_Umana</dc:creator>
      <dc:date>2025-02-18T01:49:14Z</dc:date>
    </item>
    <item>
      <title>Re: Writing back from notebook to blob storage as single file with UC configured databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/writing-back-from-notebook-to-blob-storage-as-single-file-with/m-p/110447#M43577</link>
      <description>&lt;P&gt;Hi Shivap&lt;/P&gt;&lt;P&gt;If you want to save a dataframe as a single file, you could consider to convert the pyspark dataframe to a pandas dataframe and then save it as file.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;path_single_file = '/Volumes/demo/raw/test/single'

# create sample dataframe
df = spark.createDataFrame(
    [(i, f'name_{i}', i*10, i%2 == 0, f'2025-02-{i+1:02d}') for i in range(1, 21)],
    ['id', 'name', 'value', 'is_even', 'date']
)

# convert df to pandas dataframe
pdf = df.toPandas()

# create folder, if not exists
import os
if not os.path.exists(path_single_file):
    os.makedirs(path_single_file)

pdf.to_parquet(f'{path_single_file}/my_df.parquet', index=False)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;When I then have a look into the directory, it looks like this:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="StefanKoch_0-1739858098639.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14913i1BDF719E793B3FF5/image-size/medium?v=v2&amp;amp;px=400" role="button" title="StefanKoch_0-1739858098639.png" alt="StefanKoch_0-1739858098639.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 18 Feb 2025 05:59:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/writing-back-from-notebook-to-blob-storage-as-single-file-with/m-p/110447#M43577</guid>
      <dc:creator>Stefan-Koch</dc:creator>
      <dc:date>2025-02-18T05:59:22Z</dc:date>
    </item>
  </channel>
</rss>

