cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How does partition pruning work on a merge into statement?

ghofigjong
New Contributor

I have a delta table that is partitioned by Year, Date and month. I'm trying to merge data to this on all three partition columns + an extra column (an ID). My merge statement is below:

MERGE INTO delta.<path of delta table> oldData 
using df newData 
on oldData.year = '2023' and oldData.month = '10' and oldData.day = '12' and oldData.clientid=newData.clientid WHEN MATCHED THEN DELETE

When I do an explain on this query I get this plan:

== Physical Plan ==
Execute MergeIntoCommandEdge
   +- MergeIntoCommandEdge SubqueryAlias newData, SubqueryAlias oldData, Delta[version=4, ... , (((year#3761220 = 2023) AND (month#3761221 = 10)) AND ((day#3761222 = 12) AND (clientid#3761212 = clientid#3532751)))

This query runs for a long time considering I think the data I'm trying to process is small (less than 1M). Also based on this link I should see some sort of mention of partition in the physical plan but I don't. Any ideas why it seems that my merge statement is not using the partition when executing the merge statement?

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@Joceph Moreno​ : Based on the information you have provided, it seems like you are specifying the partition keys in the ON condition of the merge statement. However, you should be using the partition columns in the USING clause of the merge statement instead.

Csn you check if the below works for you?

MERGE INTO delta.<path of delta table> oldData 
USING (
  SELECT * FROM df WHERE year = '2023' AND month = '10' AND day = '12'
) newData 
ON oldData.clientid = newData.clientid 
WHEN MATCHED THEN DELETE;

In this example, the partition columns (year, month, day) are used in the subquery in the USING clause to filter the data being merged. The partition columns are not included in the ON condition, as they are already being used to filter the data. Instead, the clientid column is used in the ON condition to match records between the old and new data. With this approach, the merge operation should only apply to the partition containing the data that matches the filter condition in the USING clause, which should improve performance.

Please let us know if this helps.

View solution in original post

2 REPLIES 2

Anonymous
Not applicable

@Joceph Moreno​ : Based on the information you have provided, it seems like you are specifying the partition keys in the ON condition of the merge statement. However, you should be using the partition columns in the USING clause of the merge statement instead.

Csn you check if the below works for you?

MERGE INTO delta.<path of delta table> oldData 
USING (
  SELECT * FROM df WHERE year = '2023' AND month = '10' AND day = '12'
) newData 
ON oldData.clientid = newData.clientid 
WHEN MATCHED THEN DELETE;

In this example, the partition columns (year, month, day) are used in the subquery in the USING clause to filter the data being merged. The partition columns are not included in the ON condition, as they are already being used to filter the data. Instead, the clientid column is used in the ON condition to match records between the old and new data. With this approach, the merge operation should only apply to the partition containing the data that matches the filter condition in the USING clause, which should improve performance.

Please let us know if this helps.

Umesh_S
New Contributor II

Isn't the suggested idea only filtering the input dataframe (resulting in a smaller amount of data to match across the whole delta table) rather than prune the delta table for relevant partitions to scan?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group