<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Confused with databricks Tips and Tricks - Optimizations regarding partitining in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/confused-with-databricks-tips-and-tricks-optimizations-regarding/m-p/49424#M1594</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Sorry for asking again but this recommendation of "&lt;SPAN&gt;don't partition tables &amp;lt;1TB" but this recommendation applies when this is written on disk or when the job is actually running in memory too? (sorry if the answer is obvious)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Because in clusters we have GBs of memory to allocate to our datasets not TBs.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="eimis_pacheco_0-1697575164841.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/4487i2C106DC5C0F5B7C2/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="eimis_pacheco_0-1697575164841.png" alt="eimis_pacheco_0-1697575164841.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Thank you once more time&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 17 Oct 2023 20:40:28 GMT</pubDate>
    <dc:creator>eimis_pacheco</dc:creator>
    <dc:date>2023-10-17T20:40:28Z</dc:date>
    <item>
      <title>Confused with databricks Tips and Tricks - Optimizations regarding partitining</title>
      <link>https://community.databricks.com/t5/get-started-discussions/confused-with-databricks-tips-and-tricks-optimizations-regarding/m-p/46699#M1349</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Hello Community,&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Today I was in&amp;nbsp;Tips and Tricks - Optimizations webinar and I started being confused, they said:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;Don't partition tables &amp;lt;1TB in size and plan carefully when partitioning&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;• &lt;/SPAN&gt;&lt;SPAN&gt;Partitions should be &amp;gt;=1GB&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Now my confusion is if this recommendation is given for storing the data on disk while writing it at the end of the spark job?&amp;nbsp;or does this apply when we are running a job and we are doing some transformations and we want to split this data into more executors so that this run faster? I mean if I want to partition a table that is close to 1TB to make the job faster by splitting the same let's say in 10 executors should I not do this?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you in advance for the clarification.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;#dataengineering&lt;/P&gt;</description>
      <pubDate>Fri, 29 Sep 2023 07:15:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/confused-with-databricks-tips-and-tricks-optimizations-regarding/m-p/46699#M1349</guid>
      <dc:creator>eimis_pacheco</dc:creator>
      <dc:date>2023-09-29T07:15:09Z</dc:date>
    </item>
    <item>
      <title>Re: Confused with databricks Tips and Tricks - Optimizations regarding partitining</title>
      <link>https://community.databricks.com/t5/get-started-discussions/confused-with-databricks-tips-and-tricks-optimizations-regarding/m-p/49424#M1594</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Sorry for asking again but this recommendation of "&lt;SPAN&gt;don't partition tables &amp;lt;1TB" but this recommendation applies when this is written on disk or when the job is actually running in memory too? (sorry if the answer is obvious)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Because in clusters we have GBs of memory to allocate to our datasets not TBs.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="eimis_pacheco_0-1697575164841.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/4487i2C106DC5C0F5B7C2/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="eimis_pacheco_0-1697575164841.png" alt="eimis_pacheco_0-1697575164841.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Thank you once more time&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 17 Oct 2023 20:40:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/confused-with-databricks-tips-and-tricks-optimizations-regarding/m-p/49424#M1594</guid>
      <dc:creator>eimis_pacheco</dc:creator>
      <dc:date>2023-10-17T20:40:28Z</dc:date>
    </item>
    <item>
      <title>Re: Confused with databricks Tips and Tricks - Optimizations regarding partitining</title>
      <link>https://community.databricks.com/t5/get-started-discussions/confused-with-databricks-tips-and-tricks-optimizations-regarding/m-p/49443#M1599</link>
      <description>&lt;P&gt;that is partitions on disk.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Defining the correct amount of partitions is not that easy.&amp;nbsp; One would think that more partitions is better because you can process more data in parallel.&lt;BR /&gt;And that is true if you only have to do local transformations (no shuffle needed).&lt;BR /&gt;But that is almost never the case.&lt;BR /&gt;When a shuffle is applied, having more partitions means more overhead.&amp;nbsp; That is why the recommendation is to not partition below 1TB (that is only a recommendation though, you might have cases where partitioning makes sense with smaller data).&lt;BR /&gt;&lt;BR /&gt;Recently there is liquid clustering, which makes things easier (no partitioning necessary), but it only works for delta lake and recent databricks versions.&lt;/P&gt;</description>
      <pubDate>Wed, 18 Oct 2023 06:43:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/confused-with-databricks-tips-and-tricks-optimizations-regarding/m-p/49443#M1599</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-10-18T06:43:27Z</dc:date>
    </item>
    <item>
      <title>Re: Confused with databricks Tips and Tricks - Optimizations regarding partitining</title>
      <link>https://community.databricks.com/t5/get-started-discussions/confused-with-databricks-tips-and-tricks-optimizations-regarding/m-p/74792#M3145</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;Thanks for the tricks.&amp;nbsp;&lt;BR /&gt;I have a table where roughly 1.7 TB data is being ingested daily.&amp;nbsp;&lt;BR /&gt;I have partitioned it on date. In s3 I can see the size of files for one day in parquet format as 1.7 TB.&amp;nbsp;&lt;BR /&gt;Should I change the partition values ? Is the partition size too big?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 18 Jun 2024 06:25:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/confused-with-databricks-tips-and-tricks-optimizations-regarding/m-p/74792#M3145</guid>
      <dc:creator>mohitmanna</dc:creator>
      <dc:date>2024-06-18T06:25:20Z</dc:date>
    </item>
  </channel>
</rss>

