cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Sort after update on DWH

shusharin_anton
Visitor

Running query on serverless DWH:

UPDATE
catalog.schema.table
SET
col_tmp = CAST(col as DECIMAL(30, 15))

In query profiling, it has some sort and shuffle stages in graph.
Table has partition by partition_date column
Some details in sort node mentions that sort happens on
catalog.schema.table.partition_date ASC NULLS FIRST

What triggers it?
Some kind of post-optimization like optimizeWrite (tried to disable it - do not work)?




1 ACCEPTED SOLUTION

Accepted Solutions

Alberto_Umana
Databricks Employee
Databricks Employee

Hi @shusharin_anton,

The sort and shuffle stages in your query profile are likely triggered by the need to redistribute and order the data based on the partition_date column. This behavior can be attributed to the way Spark handles data partitioning and sorting during query execution.

 

When you run an UPDATE statement, Spark may need to ensure that the data is correctly partitioned and sorted to apply the updates efficiently. This can involve shuffling data across different nodes to align with the partitioning scheme and then sorting it to maintain the correct order.

The sort operation on catalog.schema.table.partition_date ASC NULLS FIRST indicates that Spark is sorting the data based on the partition_date column in ascending order, placing null values first. This sorting is necessary to ensure that the updates are applied in the correct order, especially if the partition_date column is used for partitioning the table.

Disabling optimizeWrite might not affect this behavior because the sort and shuffle operations are fundamental to how Spark processes and optimizes queries involving partitioned tables. These operations are part of the query execution plan to ensure data consistency and efficient updates

View solution in original post

1 REPLY 1

Alberto_Umana
Databricks Employee
Databricks Employee

Hi @shusharin_anton,

The sort and shuffle stages in your query profile are likely triggered by the need to redistribute and order the data based on the partition_date column. This behavior can be attributed to the way Spark handles data partitioning and sorting during query execution.

 

When you run an UPDATE statement, Spark may need to ensure that the data is correctly partitioned and sorted to apply the updates efficiently. This can involve shuffling data across different nodes to align with the partitioning scheme and then sorting it to maintain the correct order.

The sort operation on catalog.schema.table.partition_date ASC NULLS FIRST indicates that Spark is sorting the data based on the partition_date column in ascending order, placing null values first. This sorting is necessary to ensure that the updates are applied in the correct order, especially if the partition_date column is used for partitioning the table.

Disabling optimizeWrite might not affect this behavior because the sort and shuffle operations are fundamental to how Spark processes and optimizes queries involving partitioned tables. These operations are part of the query execution plan to ensure data consistency and efficient updates

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group