<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: data file size in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/data-file-size/m-p/122217#M46703</link>
    <description>&lt;P&gt;What are the criterias based on which max and min files sizes vary from target file size?&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 19 Jun 2025 07:33:26 GMT</pubDate>
    <dc:creator>pooja_bhumandla</dc:creator>
    <dc:date>2025-06-19T07:33:26Z</dc:date>
    <item>
      <title>data file size</title>
      <link>https://community.databricks.com/t5/data-engineering/data-file-size/m-p/122136#M46669</link>
      <description>&lt;P&gt;"numRemovedFiles": "2099",&lt;BR /&gt;"numRemovedBytes": "29658974681",&lt;BR /&gt;"p25FileSize": "29701688",&lt;BR /&gt;"numDeletionVectorsRemoved": "0",&lt;BR /&gt;"minFileSize": "19920357",&lt;BR /&gt;"numAddedFiles": "883",&lt;BR /&gt;"maxFileSize": "43475356",&lt;BR /&gt;"p75FileSize": "34394580",&lt;BR /&gt;"p50FileSize": "31978037",&lt;BR /&gt;"numAddedBytes": "28254074450"&lt;/P&gt;&lt;P&gt;targetFileSize: "33554432"&lt;BR /&gt;&lt;BR /&gt;I have the above information after optimizing. Why is the maxFileSize is greater than targetFileSize and similarly y is minFileSize is less than targetFileSize? does targetFileSize has any significance? If it has any significance, then y max and min file sizes are not same as targetFileSize? Based on what criteria the maxFileSize and minFileSize are decided?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 18 Jun 2025 14:35:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-file-size/m-p/122136#M46669</guid>
      <dc:creator>pooja_bhumandla</dc:creator>
      <dc:date>2025-06-18T14:35:40Z</dc:date>
    </item>
    <item>
      <title>Re: data file size</title>
      <link>https://community.databricks.com/t5/data-engineering/data-file-size/m-p/122143#M46672</link>
      <description>&lt;P&gt;Hello Pooja&lt;BR /&gt;&lt;BR /&gt;Target File Size (TFS) is a Delta Lake table property (delta.targetFileSize) that provides the flexibility to specify the desired size of the data files in the root Delta Lake table directory. It ensures Delta Lake tables are written to storage with the specified, &lt;STRONG&gt;approximate, file size&lt;/STRONG&gt;. So definitely it is important, but Delta Lake does not guarantee that all output files after OPTIMIZE will be exactly targetFileSize. It instead aims to:&lt;/P&gt;&lt;P&gt;1. Avoid small files&lt;BR /&gt;2. Avoid splitting rows or complex data types mid-record&lt;BR /&gt;3. and so on&lt;/P&gt;&lt;P&gt;Thats why you see variation on the min and max, and on the percentiles stats.&lt;BR /&gt;While the maxFileSize and minFileSize are based on these criterias (not only):&lt;/P&gt;&lt;P&gt;1. targetFileSize (as a guideline)&lt;BR /&gt;2. Partition Size &amp;amp; Skew&lt;BR /&gt;3. Row and schema characteristics&lt;BR /&gt;4. ...&lt;/P&gt;&lt;P&gt;Best, Ilir&lt;/P&gt;</description>
      <pubDate>Wed, 18 Jun 2025 15:19:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-file-size/m-p/122143#M46672</guid>
      <dc:creator>ilir_nuredini</dc:creator>
      <dc:date>2025-06-18T15:19:35Z</dc:date>
    </item>
    <item>
      <title>Re: data file size</title>
      <link>https://community.databricks.com/t5/data-engineering/data-file-size/m-p/122217#M46703</link>
      <description>&lt;P&gt;What are the criterias based on which max and min files sizes vary from target file size?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 19 Jun 2025 07:33:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-file-size/m-p/122217#M46703</guid>
      <dc:creator>pooja_bhumandla</dc:creator>
      <dc:date>2025-06-19T07:33:26Z</dc:date>
    </item>
    <item>
      <title>Re: data file size</title>
      <link>https://community.databricks.com/t5/data-engineering/data-file-size/m-p/122232#M46706</link>
      <description>&lt;P&gt;The criterias based on which the max and min size may vary from the target file size are:&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;1.&amp;nbsp;Partition Size &amp;amp; Data Skew&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;2.&amp;nbsp;Row Size and Schema Complexity&lt;/SPAN&gt;&lt;BR /&gt;3.&amp;nbsp;Cost-Based Optimization Heuristics&lt;BR /&gt;4. ...&lt;/P&gt;</description>
      <pubDate>Thu, 19 Jun 2025 09:50:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-file-size/m-p/122232#M46706</guid>
      <dc:creator>ilir_nuredini</dc:creator>
      <dc:date>2025-06-19T09:50:31Z</dc:date>
    </item>
  </channel>
</rss>

