<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Sort after update on DWH in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/sort-after-update-on-dwh/m-p/102487#M41138</link>
    <description>&lt;P&gt;Running query on serverless DWH:&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;UPDATE&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;catalog.schema.table&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;SET&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;col_tmp&amp;nbsp;= CAST(col&amp;nbsp;as DECIMAL(30, 15&lt;/STRONG&gt;&lt;SPAN&gt;&lt;STRONG&gt;))&lt;BR /&gt;&lt;/STRONG&gt;&lt;BR /&gt;In query profiling, it has some sort and shuffle stages in graph.&lt;BR /&gt;Table has partition by partition_date column&lt;BR /&gt;Some details in sort node mentions that sort happens on&lt;BR /&gt;&lt;STRONG&gt;catalog.schema.table.partition_date ASC NULLS FIRST&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;What triggers it?&lt;BR /&gt;Some kind of post-optimization like optimizeWrite (tried to disable it - do not work)?&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Wed, 18 Dec 2024 13:45:53 GMT</pubDate>
    <dc:creator>shusharin_anton</dc:creator>
    <dc:date>2024-12-18T13:45:53Z</dc:date>
    <item>
      <title>Sort after update on DWH</title>
      <link>https://community.databricks.com/t5/data-engineering/sort-after-update-on-dwh/m-p/102487#M41138</link>
      <description>&lt;P&gt;Running query on serverless DWH:&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;UPDATE&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;catalog.schema.table&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;SET&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;col_tmp&amp;nbsp;= CAST(col&amp;nbsp;as DECIMAL(30, 15&lt;/STRONG&gt;&lt;SPAN&gt;&lt;STRONG&gt;))&lt;BR /&gt;&lt;/STRONG&gt;&lt;BR /&gt;In query profiling, it has some sort and shuffle stages in graph.&lt;BR /&gt;Table has partition by partition_date column&lt;BR /&gt;Some details in sort node mentions that sort happens on&lt;BR /&gt;&lt;STRONG&gt;catalog.schema.table.partition_date ASC NULLS FIRST&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;What triggers it?&lt;BR /&gt;Some kind of post-optimization like optimizeWrite (tried to disable it - do not work)?&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 18 Dec 2024 13:45:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/sort-after-update-on-dwh/m-p/102487#M41138</guid>
      <dc:creator>shusharin_anton</dc:creator>
      <dc:date>2024-12-18T13:45:53Z</dc:date>
    </item>
    <item>
      <title>Re: Sort after update on DWH</title>
      <link>https://community.databricks.com/t5/data-engineering/sort-after-update-on-dwh/m-p/102488#M41139</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/136844"&gt;@shusharin_anton&lt;/a&gt;,&lt;/P&gt;
&lt;P class="p1"&gt;The sort and shuffle stages in your query profile are likely triggered by the need to redistribute and order the data based on the partition_date column. This behavior can be attributed to the way Spark handles data partitioning and sorting during query execution.&lt;/P&gt;
&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class="p1"&gt;When you run an UPDATE statement, Spark may need to ensure that the data is correctly partitioned and sorted to apply the updates efficiently. This can involve shuffling data across different nodes to align with the partitioning scheme and then sorting it to maintain the correct order.&lt;/P&gt;
&lt;P class="p1"&gt;The sort operation on catalog.schema.table.partition_date ASC NULLS FIRST indicates that Spark is sorting the data based on the partition_date column in ascending order, placing null values first. This sorting is necessary to ensure that the updates are applied in the correct order, especially if the partition_date column is used for partitioning the table.&lt;/P&gt;
&lt;P class="p1"&gt;Disabling optimizeWrite might not affect this behavior because the sort and shuffle operations are fundamental to how Spark processes and optimizes queries involving partitioned tables. These operations are part of the query execution plan to ensure data consistency and efficient updates&lt;/P&gt;</description>
      <pubDate>Wed, 18 Dec 2024 13:51:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/sort-after-update-on-dwh/m-p/102488#M41139</guid>
      <dc:creator>Alberto_Umana</dc:creator>
      <dc:date>2024-12-18T13:51:55Z</dc:date>
    </item>
  </channel>
</rss>

