<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Pre-Partitioning a delta table to reduce suffling of wide operation in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/pre-partitioning-a-delta-table-to-reduce-suffling-of-wide/m-p/82915#M36773</link>
    <description>&lt;P&gt;Hi Maatari!&lt;/P&gt;&lt;P&gt;How are you doing today?&lt;/P&gt;&lt;P&gt;When you group data by a column in a Delta table, Spark typically has to shuffle the data to get all the same values together. But if your Delta table is already partitioned by that same column, the shuffling is much less because the data is already nicely organized.&lt;/P&gt;&lt;P&gt;For example, if your Delta table is partitioned by store_id, and you want to group by store_id&amp;nbsp;to see total sales per store, Spark can do that faster since it doesn't need to move data around as much.&lt;/P&gt;&lt;P&gt;Also, when you load data from a Delta table into a DataFrame, Spark usually respects the table’s partitioning. So if your table is partitioned by store_id, your DataFrame might also be partitioned that way, which again helps reduce shuffling during operations like &lt;STRONG&gt;groupby.&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;In short, if you partition your Delta table by the column you plan to group by, it can make your queries run a lot smoother!&amp;nbsp;&lt;/P&gt;&lt;P&gt;Have a good day.&lt;/P&gt;</description>
    <pubDate>Wed, 14 Aug 2024 03:41:30 GMT</pubDate>
    <dc:creator>Brahmareddy</dc:creator>
    <dc:date>2024-08-14T03:41:30Z</dc:date>
    <item>
      <title>Pre-Partitioning a delta table to reduce suffling of wide operation</title>
      <link>https://community.databricks.com/t5/data-engineering/pre-partitioning-a-delta-table-to-reduce-suffling-of-wide/m-p/82874#M36758</link>
      <description>&lt;P&gt;Assuming i need to perfom a groupby i.e. aggregation on a dataset stored in a delta table. If the delta table is partitioned by the field by which to group, can that have an impact on the suffling that the groupby would normally cause ?&amp;nbsp;&lt;/P&gt;&lt;P&gt;As a connected question, one can ask is there any correlation between how a delta table is partitioned and how the data is put into the dataframe partition when loading the data ?&lt;/P&gt;</description>
      <pubDate>Tue, 13 Aug 2024 13:02:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pre-partitioning-a-delta-table-to-reduce-suffling-of-wide/m-p/82874#M36758</guid>
      <dc:creator>Maatari</dc:creator>
      <dc:date>2024-08-13T13:02:25Z</dc:date>
    </item>
    <item>
      <title>Re: Pre-Partitioning a delta table to reduce suffling of wide operation</title>
      <link>https://community.databricks.com/t5/data-engineering/pre-partitioning-a-delta-table-to-reduce-suffling-of-wide/m-p/82932#M36783</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/102834"&gt;@Maatari&lt;/a&gt;, Thanks for reaching out! Please review the responses and let us know which best addresses your question. Your feedback is valuable to us and the community.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If the response resolves your issue, kindly mark it as the accepted solution. This will help close the thread and assist others with similar queries.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We appreciate your participation and are here if you need further assistance!&lt;/P&gt;</description>
      <pubDate>Wed, 14 Aug 2024 08:07:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pre-partitioning-a-delta-table-to-reduce-suffling-of-wide/m-p/82932#M36783</guid>
      <dc:creator>Retired_mod</dc:creator>
      <dc:date>2024-08-14T08:07:13Z</dc:date>
    </item>
  </channel>
</rss>

