cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How does partition pruning work on a merge into statement?

ghofigjong
New Contributor

I have a delta table that is partitioned by Year, Date and month. I'm trying to merge data to this on all three partition columns + an extra column (an ID). My merge statement is below:

MERGE INTO delta.<path of delta table> oldData 
using df newData 
on oldData.year = '2023' and oldData.month = '10' and oldData.day = '12' and oldData.clientid=newData.clientid WHEN MATCHED THEN DELETE

When I do an explain on this query I get this plan:

== Physical Plan ==
Execute MergeIntoCommandEdge
   +- MergeIntoCommandEdge SubqueryAlias newData, SubqueryAlias oldData, Delta[version=4, ... , (((year#3761220 = 2023) AND (month#3761221 = 10)) AND ((day#3761222 = 12) AND (clientid#3761212 = clientid#3532751)))

This query runs for a long time considering I think the data I'm trying to process is small (less than 1M). Also based on this link I should see some sort of mention of partition in the physical plan but I don't. Any ideas why it seems that my merge statement is not using the partition when executing the merge statement?

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@Joceph Moreno​ : Based on the information you have provided, it seems like you are specifying the partition keys in the ON condition of the merge statement. However, you should be using the partition columns in the USING clause of the merge statement instead.

Csn you check if the below works for you?

MERGE INTO delta.<path of delta table> oldData 
USING (
  SELECT * FROM df WHERE year = '2023' AND month = '10' AND day = '12'
) newData 
ON oldData.clientid = newData.clientid 
WHEN MATCHED THEN DELETE;

In this example, the partition columns (year, month, day) are used in the subquery in the USING clause to filter the data being merged. The partition columns are not included in the ON condition, as they are already being used to filter the data. Instead, the clientid column is used in the ON condition to match records between the old and new data. With this approach, the merge operation should only apply to the partition containing the data that matches the filter condition in the USING clause, which should improve performance.

Please let us know if this helps.

View solution in original post

2 REPLIES 2

Anonymous
Not applicable

@Joceph Moreno​ : Based on the information you have provided, it seems like you are specifying the partition keys in the ON condition of the merge statement. However, you should be using the partition columns in the USING clause of the merge statement instead.

Csn you check if the below works for you?

MERGE INTO delta.<path of delta table> oldData 
USING (
  SELECT * FROM df WHERE year = '2023' AND month = '10' AND day = '12'
) newData 
ON oldData.clientid = newData.clientid 
WHEN MATCHED THEN DELETE;

In this example, the partition columns (year, month, day) are used in the subquery in the USING clause to filter the data being merged. The partition columns are not included in the ON condition, as they are already being used to filter the data. Instead, the clientid column is used in the ON condition to match records between the old and new data. With this approach, the merge operation should only apply to the partition containing the data that matches the filter condition in the USING clause, which should improve performance.

Please let us know if this helps.

Umesh_S
New Contributor II

Isn't the suggested idea only filtering the input dataframe (resulting in a smaller amount of data to match across the whole delta table) rather than prune the delta table for relevant partitions to scan?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.