<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Configuring average parquet file size in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/configuring-average-parquet-file-size/m-p/3635#M616</link>
    <description>&lt;P&gt;I see.&lt;/P&gt;&lt;P&gt;That is not an easy one.&lt;/P&gt;&lt;P&gt;That depends on how big your data is, if it will be partitioned etc.&lt;/P&gt;&lt;P&gt;If you use partitioning, it will be the cardinality of the partition column which decides the size of the output.&lt;/P&gt;&lt;P&gt;If you have very small data, a single file might be enough etc.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;So the best way is to first explore your data, getting to know the data profile.&lt;/P&gt;&lt;P&gt; With coalesce and repartition you can define the number of partitions (files) that will be written. (but this has a cost of course, an extra shuffle)&lt;/P&gt;&lt;P&gt;But be aware that there is no single optimal file size.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Delta Lake is an option, which has some automated file optimizations. Definitely worth trying out.&lt;/P&gt;</description>
    <pubDate>Thu, 08 Jun 2023 07:28:06 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2023-06-08T07:28:06Z</dc:date>
    <item>
      <title>Configuring average parquet file size</title>
      <link>https://community.databricks.com/t5/data-engineering/configuring-average-parquet-file-size/m-p/3632#M613</link>
      <description>&lt;P&gt;I have S3 as a data source containing sample TPC dataset (10G, 100G).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I want to convert that into parquet files with an average size of about ~256MiB. What configuration parameter can I use to set that?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I also need the data to be partitioned. And within each partition column, the files should be split based on the average size.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If I set the coalesce option `df.coalesce(1)`, then it only creates 1 file in each partition column.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I've also tried setting the parameters (based off Google search)&lt;/P&gt;&lt;P&gt;```&lt;/P&gt;&lt;P&gt;df.write.option("maxRecordsPerFile", 6000000)&lt;/P&gt;&lt;P&gt;df.write.option("parquet.block.size", 256 * 1024 * 1024)&lt;/P&gt;&lt;P&gt;```&lt;/P&gt;&lt;P&gt;But that didn't help either. Any suggestions?&lt;/P&gt;</description>
      <pubDate>Mon, 05 Jun 2023 06:51:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/configuring-average-parquet-file-size/m-p/3632#M613</guid>
      <dc:creator>f2008700</dc:creator>
      <dc:date>2023-06-05T06:51:00Z</dc:date>
    </item>
    <item>
      <title>Re: Configuring average parquet file size</title>
      <link>https://community.databricks.com/t5/data-engineering/configuring-average-parquet-file-size/m-p/3633#M614</link>
      <description>&lt;P&gt;Hi, you might wanna check &lt;A href="https://stackoverflow.com/questions/62648621/spark-sql-files-maxpartitionbytes-not-limiting-max-size-of-written-partitions" alt="https://stackoverflow.com/questions/62648621/spark-sql-files-maxpartitionbytes-not-limiting-max-size-of-written-partitions" target="_blank"&gt;this topic&lt;/A&gt;.&lt;/P&gt;&lt;P&gt;maxPartitionBytes controls the file size but it is not a hard constraint (see the SO topic).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 07 Jun 2023 08:32:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/configuring-average-parquet-file-size/m-p/3633#M614</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-06-07T08:32:01Z</dc:date>
    </item>
    <item>
      <title>Re: Configuring average parquet file size</title>
      <link>https://community.databricks.com/t5/data-engineering/configuring-average-parquet-file-size/m-p/3634#M615</link>
      <description>&lt;P&gt;Thanks for pointing it out. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;My ask if actually the opposite. I want to ensure that the basic size of parquet file is not too small, but rather more than a certain basic size/ or contains a minimum number of records.&lt;/P&gt;</description>
      <pubDate>Wed, 07 Jun 2023 19:19:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/configuring-average-parquet-file-size/m-p/3634#M615</guid>
      <dc:creator>f2008700</dc:creator>
      <dc:date>2023-06-07T19:19:11Z</dc:date>
    </item>
    <item>
      <title>Re: Configuring average parquet file size</title>
      <link>https://community.databricks.com/t5/data-engineering/configuring-average-parquet-file-size/m-p/3635#M616</link>
      <description>&lt;P&gt;I see.&lt;/P&gt;&lt;P&gt;That is not an easy one.&lt;/P&gt;&lt;P&gt;That depends on how big your data is, if it will be partitioned etc.&lt;/P&gt;&lt;P&gt;If you use partitioning, it will be the cardinality of the partition column which decides the size of the output.&lt;/P&gt;&lt;P&gt;If you have very small data, a single file might be enough etc.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;So the best way is to first explore your data, getting to know the data profile.&lt;/P&gt;&lt;P&gt; With coalesce and repartition you can define the number of partitions (files) that will be written. (but this has a cost of course, an extra shuffle)&lt;/P&gt;&lt;P&gt;But be aware that there is no single optimal file size.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Delta Lake is an option, which has some automated file optimizations. Definitely worth trying out.&lt;/P&gt;</description>
      <pubDate>Thu, 08 Jun 2023 07:28:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/configuring-average-parquet-file-size/m-p/3635#M616</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-06-08T07:28:06Z</dc:date>
    </item>
    <item>
      <title>Re: Configuring average parquet file size</title>
      <link>https://community.databricks.com/t5/data-engineering/configuring-average-parquet-file-size/m-p/3636#M617</link>
      <description>&lt;P&gt;Hi @Vikas Goel​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We haven't heard from you since the last response from @Werner Stinckens​&amp;nbsp;​, and I was checking back to see if her suggestions helped you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Or else, If you have any solution, please share it with the community, as it can be helpful to others.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 10 Jun 2023 02:37:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/configuring-average-parquet-file-size/m-p/3636#M617</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-06-10T02:37:16Z</dc:date>
    </item>
    <item>
      <title>Re: Configuring average parquet file size</title>
      <link>https://community.databricks.com/t5/data-engineering/configuring-average-parquet-file-size/m-p/3637#M618</link>
      <description>&lt;P&gt;@Vikas Goel​&amp;nbsp;:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Adding more pointers here-&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If you want to ensure that the Parquet files have a minimum size or contain a minimum number of records, you can use the minRecordsPerFile option when writing the DataFrame to Parquet format. Here's how you can modify your code:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;# Set the desired minimum number of records per Parquet file
min_records_per_file = 1000
&amp;nbsp;
# Calculate the maximum number of records per file based on the desired minimum file size
max_records_per_file = int((256 * 1024 * 1024) / (df.schema.jsonSize() + 100))
&amp;nbsp;
# Choose the maximum value between the desired minimum and calculated maximum
records_per_file = max(min_records_per_file, max_records_per_file)
&amp;nbsp;
# Write the DataFrame to Parquet format with the specified minimum records per file
df.write.option("minRecordsPerFile", records_per_file).parquet("s3://your_bucket/parquet_output_path")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 13 Jun 2023 14:38:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/configuring-average-parquet-file-size/m-p/3637#M618</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-06-13T14:38:21Z</dc:date>
    </item>
    <item>
      <title>Re: Configuring average parquet file size</title>
      <link>https://community.databricks.com/t5/data-engineering/configuring-average-parquet-file-size/m-p/3638#M619</link>
      <description>&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Will give it a try.&lt;/P&gt;</description>
      <pubDate>Tue, 13 Jun 2023 21:31:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/configuring-average-parquet-file-size/m-p/3638#M619</guid>
      <dc:creator>f2008700</dc:creator>
      <dc:date>2023-06-13T21:31:16Z</dc:date>
    </item>
  </channel>
</rss>

