<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Small/big file problem, how do you fix it ? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/small-big-file-problem-how-do-you-fix-it/m-p/13976#M8548</link>
    <description>&lt;P&gt;Okay @Jose Gonzalez​ , I understand .. thank you man&lt;/P&gt;</description>
    <pubDate>Mon, 11 Oct 2021 21:14:08 GMT</pubDate>
    <dc:creator>William_Scardua</dc:creator>
    <dc:date>2021-10-11T21:14:08Z</dc:date>
    <item>
      <title>Small/big file problem, how do you fix it ?</title>
      <link>https://community.databricks.com/t5/data-engineering/small-big-file-problem-how-do-you-fix-it/m-p/13972#M8544</link>
      <description>&lt;P&gt;How do you work to fixing the small/big file problem ? what you suggest ?&lt;/P&gt;</description>
      <pubDate>Thu, 07 Oct 2021 01:13:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/small-big-file-problem-how-do-you-fix-it/m-p/13972#M8544</guid>
      <dc:creator>William_Scardua</dc:creator>
      <dc:date>2021-10-07T01:13:12Z</dc:date>
    </item>
    <item>
      <title>Re: Small/big file problem, how do you fix it ?</title>
      <link>https://community.databricks.com/t5/data-engineering/small-big-file-problem-how-do-you-fix-it/m-p/13974#M8546</link>
      <description>&lt;P&gt;Hi @William Scardua​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I will recommend to use Delta to avoid having small/big files issues. For example, Auto Optimize is an optional set of features that automatically compact small files during individual writes to a Delta table. Paying a small cost during writes offers significant benefits for tables that are queried actively. For more details and examples please check the following &lt;A href="https://docs.databricks.com/delta/optimizations/auto-optimize.html" alt="https://docs.databricks.com/delta/optimizations/auto-optimize.html" target="_blank"&gt;link&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Auto optimize will create files of 128 MB each. If you would like to compress and optimize further, then I will recommend to use "Optimize" command on your Delta tables. It will compress and create files of 1 GB in size, by default. For more details on this optimize feature, please check the following &lt;A href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html" alt="https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html" target="_blank"&gt;link&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you.&lt;/P&gt;</description>
      <pubDate>Thu, 07 Oct 2021 16:54:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/small-big-file-problem-how-do-you-fix-it/m-p/13974#M8546</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2021-10-07T16:54:26Z</dc:date>
    </item>
    <item>
      <title>Re: Small/big file problem, how do you fix it ?</title>
      <link>https://community.databricks.com/t5/data-engineering/small-big-file-problem-how-do-you-fix-it/m-p/13975#M8547</link>
      <description>&lt;P&gt;What Jose said.&lt;/P&gt;&lt;P&gt;If you cannot use delta or do not want to:&lt;/P&gt;&lt;P&gt;the use of coalesce and repartition/partitioning is the way to define the file size.&lt;/P&gt;&lt;P&gt;There is no one ideal file size.  It all depends on the use case, available cluster size, data flow downstream etc.&lt;/P&gt;&lt;P&gt;What you do want to avoid is a lot of small files (think only a few megabytes or kilobytes).&lt;/P&gt;&lt;P&gt;But there is nothing wrong with a &lt;B&gt;single&lt;/B&gt; file of 2 MB.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;That being said: delta lake makes this exercise way easier.&lt;/P&gt;</description>
      <pubDate>Fri, 08 Oct 2021 07:01:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/small-big-file-problem-how-do-you-fix-it/m-p/13975#M8547</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-10-08T07:01:20Z</dc:date>
    </item>
    <item>
      <title>Re: Small/big file problem, how do you fix it ?</title>
      <link>https://community.databricks.com/t5/data-engineering/small-big-file-problem-how-do-you-fix-it/m-p/13976#M8548</link>
      <description>&lt;P&gt;Okay @Jose Gonzalez​ , I understand .. thank you man&lt;/P&gt;</description>
      <pubDate>Mon, 11 Oct 2021 21:14:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/small-big-file-problem-how-do-you-fix-it/m-p/13976#M8548</guid>
      <dc:creator>William_Scardua</dc:creator>
      <dc:date>2021-10-11T21:14:08Z</dc:date>
    </item>
    <item>
      <title>Re: Small/big file problem, how do you fix it ?</title>
      <link>https://community.databricks.com/t5/data-engineering/small-big-file-problem-how-do-you-fix-it/m-p/13977#M8549</link>
      <description>&lt;P&gt;thank you for feedback @Werner Stinckens​&amp;nbsp;, that`s a good point&lt;/P&gt;</description>
      <pubDate>Mon, 11 Oct 2021 21:16:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/small-big-file-problem-how-do-you-fix-it/m-p/13977#M8549</guid>
      <dc:creator>William_Scardua</dc:creator>
      <dc:date>2021-10-11T21:16:57Z</dc:date>
    </item>
  </channel>
</rss>

