<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Partition Size: in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/partition-size/m-p/107731#M42905</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/145131"&gt;@subhas_hati&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;The partition size of a 3.8 GB file read into a DataFrame differs from the default partition size of 128 MB, resulting in a partition size of 159 MB, due to the influence of the&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="c-mrkdwn__code" data-stringify-type="code"&gt;spark.sql.files.openCostInBytes&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;configuration.&lt;/SPAN&gt;&lt;SPAN&gt;•&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG data-stringify-type="bold"&gt;&lt;CODE class="c-mrkdwn__code" data-stringify-type="code"&gt;spark.sql.files.maxPartitionBytes&lt;/CODE&gt;&lt;/STRONG&gt;&lt;SPAN&gt;: This setting specifies the maximum number of bytes to pack into a single partition when reading files. The default is 128 MB.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;WBR /&gt;&lt;BR /&gt;&lt;SPAN&gt;•&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG data-stringify-type="bold"&gt;&lt;CODE class="c-mrkdwn__code" data-stringify-type="code"&gt;spark.sql.files.openCostInBytes&lt;/CODE&gt;&lt;/STRONG&gt;&lt;SPAN&gt;: This internal configuration estimates the cost to open a file, measured by the number of bytes that could be scanned simultaneously. Its default value is 4 MB and it is added as an overhead to the partition size calculation.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;The partition size calculation involves adding the&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="c-mrkdwn__code" data-stringify-type="code"&gt;spark.sql.files.openCostInBytes&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;overhead to the total file size, which can lead to larger partition sizes than the default&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="c-mrkdwn__code" data-stringify-type="code"&gt;spark.sql.files.maxPartitionBytes&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;setting. This is why the observed partition size can be 159 MB instead of the expected 128 MB.&lt;/SPAN&gt;&lt;STRONG data-stringify-type="bold"&gt;Sources:&lt;/STRONG&gt;&lt;BR /&gt;&lt;SPAN&gt;1.&amp;nbsp;&lt;/SPAN&gt;&lt;A class="c-link c-link--underline" href="https://books.japila.pl/spark-sql-internals//docs/configuration-properties.html#spark.sql.files.maxPartitionBytes" target="_blank" rel="noopener noreferrer" data-stringify-link="https://books.japila.pl/spark-sql-internals//docs/configuration-properties.html#spark.sql.files.maxPartitionBytes" data-sk="tooltip_parent"&gt;spark.sql.files.maxPartitionBytes&lt;/A&gt;&lt;BR /&gt;&lt;SPAN&gt;2.&amp;nbsp;&lt;/SPAN&gt;&lt;A class="c-link c-link--underline" href="https://books.japila.pl/spark-sql-internals//docs/configuration-properties.html#spark.sql.files.openCostInBytes" target="_blank" rel="noopener noreferrer" data-stringify-link="https://books.japila.pl/spark-sql-internals//docs/configuration-properties.html#spark.sql.files.openCostInBytes" data-sk="tooltip_parent"&gt;spark.sql.files.openCostInBytes&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 30 Jan 2025 08:29:11 GMT</pubDate>
    <dc:creator>Sidhant07</dc:creator>
    <dc:date>2025-01-30T08:29:11Z</dc:date>
    <item>
      <title>Partition Size:</title>
      <link>https://community.databricks.com/t5/data-engineering/partition-size/m-p/106556#M42518</link>
      <description>&lt;P&gt;Hi&lt;/P&gt;&lt;P&gt;I have chosen the default partition size 128 MB. I am reading a 3.8 GB file and checking the size of partition using df.rdd.getNumPartitions() as given below. I find the partition size: 159 MB.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Why the partition size after reading the file differ ?&lt;/P&gt;&lt;P&gt;&amp;nbsp;# Check the default partition size&lt;BR /&gt;partition_size = spark.conf.get("spark.sql.files.maxPartitionBytes").replace("b","")&lt;BR /&gt;print(f"Partition Size: {partition_size} in bytes and {int(partition_size) / 1024 / 1024} in MB")&lt;/P&gt;&lt;P&gt;partition_size = (file_size)/1024/1024/(df.rdd.getNumPartitions())&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jan 2025 23:28:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/partition-size/m-p/106556#M42518</guid>
      <dc:creator>subhas_hati</dc:creator>
      <dc:date>2025-01-21T23:28:04Z</dc:date>
    </item>
    <item>
      <title>Re: Partition Size:</title>
      <link>https://community.databricks.com/t5/data-engineering/partition-size/m-p/107731#M42905</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/145131"&gt;@subhas_hati&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;The partition size of a 3.8 GB file read into a DataFrame differs from the default partition size of 128 MB, resulting in a partition size of 159 MB, due to the influence of the&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="c-mrkdwn__code" data-stringify-type="code"&gt;spark.sql.files.openCostInBytes&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;configuration.&lt;/SPAN&gt;&lt;SPAN&gt;•&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG data-stringify-type="bold"&gt;&lt;CODE class="c-mrkdwn__code" data-stringify-type="code"&gt;spark.sql.files.maxPartitionBytes&lt;/CODE&gt;&lt;/STRONG&gt;&lt;SPAN&gt;: This setting specifies the maximum number of bytes to pack into a single partition when reading files. The default is 128 MB.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;WBR /&gt;&lt;BR /&gt;&lt;SPAN&gt;•&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG data-stringify-type="bold"&gt;&lt;CODE class="c-mrkdwn__code" data-stringify-type="code"&gt;spark.sql.files.openCostInBytes&lt;/CODE&gt;&lt;/STRONG&gt;&lt;SPAN&gt;: This internal configuration estimates the cost to open a file, measured by the number of bytes that could be scanned simultaneously. Its default value is 4 MB and it is added as an overhead to the partition size calculation.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;The partition size calculation involves adding the&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="c-mrkdwn__code" data-stringify-type="code"&gt;spark.sql.files.openCostInBytes&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;overhead to the total file size, which can lead to larger partition sizes than the default&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="c-mrkdwn__code" data-stringify-type="code"&gt;spark.sql.files.maxPartitionBytes&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;setting. This is why the observed partition size can be 159 MB instead of the expected 128 MB.&lt;/SPAN&gt;&lt;STRONG data-stringify-type="bold"&gt;Sources:&lt;/STRONG&gt;&lt;BR /&gt;&lt;SPAN&gt;1.&amp;nbsp;&lt;/SPAN&gt;&lt;A class="c-link c-link--underline" href="https://books.japila.pl/spark-sql-internals//docs/configuration-properties.html#spark.sql.files.maxPartitionBytes" target="_blank" rel="noopener noreferrer" data-stringify-link="https://books.japila.pl/spark-sql-internals//docs/configuration-properties.html#spark.sql.files.maxPartitionBytes" data-sk="tooltip_parent"&gt;spark.sql.files.maxPartitionBytes&lt;/A&gt;&lt;BR /&gt;&lt;SPAN&gt;2.&amp;nbsp;&lt;/SPAN&gt;&lt;A class="c-link c-link--underline" href="https://books.japila.pl/spark-sql-internals//docs/configuration-properties.html#spark.sql.files.openCostInBytes" target="_blank" rel="noopener noreferrer" data-stringify-link="https://books.japila.pl/spark-sql-internals//docs/configuration-properties.html#spark.sql.files.openCostInBytes" data-sk="tooltip_parent"&gt;spark.sql.files.openCostInBytes&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Jan 2025 08:29:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/partition-size/m-p/107731#M42905</guid>
      <dc:creator>Sidhant07</dc:creator>
      <dc:date>2025-01-30T08:29:11Z</dc:date>
    </item>
  </channel>
</rss>

