<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to control file size by OPTIMIZE in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-control-file-size-by-optimize/m-p/92685#M38507</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/100643"&gt;@MikeGo&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;Could you share the statistics of the table: how many files it has, what is the average file size?&lt;BR /&gt;1. Purpose of OPTIMIZE is to combine many small files. Does you table contain many small files?&lt;BR /&gt;2. OPTIMIZE is idempotent - if run twice without any data changes in the table, the second OPTIMIZE won't do anything.&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;In general, based on &lt;A href="https://docs.databricks.com/en/delta/tune-file-size.html" target="_self"&gt;this article&lt;/A&gt;,&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;the&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;delta.targetFileSize&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;setting acts as a guideline or target for the desired file size, but the actual file sizes can vary based on several factors, including the current size of the table, the nature of the data, and the specific operations being performed:&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="filipniziol_0-1727980267268.png" style="width: 778px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/11674i6B0F9C6A58D9F73C/image-dimensions/778x72?v=v2" width="778" height="72" role="button" title="filipniziol_0-1727980267268.png" alt="filipniziol_0-1727980267268.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 03 Oct 2024 18:32:01 GMT</pubDate>
    <dc:creator>filipniziol</dc:creator>
    <dc:date>2024-10-03T18:32:01Z</dc:date>
    <item>
      <title>How to control file size by OPTIMIZE</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-control-file-size-by-optimize/m-p/92680#M38503</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I have a delta table under UC, no partition, no liquid clustering. I tried&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;OPTIMIZE foo;
-- OR
ALTER TABLE foo SET TBLPROPERTIES(delta.targetFileSize = '128mb');
OPTIMIZE foo;&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I expect to see the files can have some change after above, but the OPTIMIZE returns 0 filesAdded and 0 filesRemoved. By "&lt;SPAN&gt;DESCRIBE&lt;/SPAN&gt; &lt;SPAN&gt;detail&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;foo" I didn't see numFiles changed.&lt;BR /&gt;Am I missing something? How to make the file size as expected? Are there some conditions to trigger OPTIMIZE to control file size?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 03 Oct 2024 17:27:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-control-file-size-by-optimize/m-p/92680#M38503</guid>
      <dc:creator>MikeGo</dc:creator>
      <dc:date>2024-10-03T17:27:43Z</dc:date>
    </item>
    <item>
      <title>Re: How to control file size by OPTIMIZE</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-control-file-size-by-optimize/m-p/92685#M38507</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/100643"&gt;@MikeGo&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;Could you share the statistics of the table: how many files it has, what is the average file size?&lt;BR /&gt;1. Purpose of OPTIMIZE is to combine many small files. Does you table contain many small files?&lt;BR /&gt;2. OPTIMIZE is idempotent - if run twice without any data changes in the table, the second OPTIMIZE won't do anything.&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;In general, based on &lt;A href="https://docs.databricks.com/en/delta/tune-file-size.html" target="_self"&gt;this article&lt;/A&gt;,&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;the&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;delta.targetFileSize&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;setting acts as a guideline or target for the desired file size, but the actual file sizes can vary based on several factors, including the current size of the table, the nature of the data, and the specific operations being performed:&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="filipniziol_0-1727980267268.png" style="width: 778px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/11674i6B0F9C6A58D9F73C/image-dimensions/778x72?v=v2" width="778" height="72" role="button" title="filipniziol_0-1727980267268.png" alt="filipniziol_0-1727980267268.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 03 Oct 2024 18:32:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-control-file-size-by-optimize/m-p/92685#M38507</guid>
      <dc:creator>filipniziol</dc:creator>
      <dc:date>2024-10-03T18:32:01Z</dc:date>
    </item>
    <item>
      <title>Re: How to control file size by OPTIMIZE</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-control-file-size-by-optimize/m-p/92701#M38512</link>
      <description>&lt;P&gt;I'm doing some testing so I create some table foo first. And then generate some test data src to do upsert by MERGE. DESC detail foo can see numFiles e.g. 3 when I play around (by MERGE or INSERT), but the files on S3 has more. And I do OPTIMIZE after MERGE or INERT, and not see any changes for files. The parquet files on S3 is from 33KB to 509KB. I expect it can be merged as one file after OPTIMIZE.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 03 Oct 2024 22:49:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-control-file-size-by-optimize/m-p/92701#M38512</guid>
      <dc:creator>MikeGo</dc:creator>
      <dc:date>2024-10-03T22:49:19Z</dc:date>
    </item>
    <item>
      <title>Re: How to control file size by OPTIMIZE</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-control-file-size-by-optimize/m-p/92702#M38513</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/100643"&gt;@MikeGo&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;&lt;BR /&gt;Databricks is a big data processing engine. Instead of testing 3 files try to test 3000 files &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&amp;nbsp;&lt;BR /&gt;OPTIMIZE isn't merging your small files because there may not be enough files or data for it to act upon.&lt;/P&gt;&lt;P&gt;Regarding why DESC DETAILS shows 3 files vs. what is in the folder: numFlies shows the active files, the files that are currently part of the latest snapshot of your delta table.&lt;/P&gt;&lt;P&gt;When you you perform write operations like INSERT, UPDATE, DELETE or MERGE, the old files are not deleted from storage, instead they are marked as deleted in Delta transaction log. They are kept to support features like time travel (you can check what was the version of the table as of yesterday)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hope it makes sense.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 03 Oct 2024 23:06:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-control-file-size-by-optimize/m-p/92702#M38513</guid>
      <dc:creator>filipniziol</dc:creator>
      <dc:date>2024-10-03T23:06:13Z</dc:date>
    </item>
  </channel>
</rss>

