Databricks

ghofigjong · ‎02-27-2023

I have a delta table that is partitioned by Year, Date and month. I'm trying to merge data to this on all three partition columns + an extra column (an ID). My merge statement is below:

MERGE INTO delta.<path of delta table> oldData 
using df newData 
on oldData.year = '2023' and oldData.month = '10' and oldData.day = '12' and oldData.clientid=newData.clientid WHEN MATCHED THEN DELETE

When I do an explain on this query I get this plan:

== Physical Plan ==
Execute MergeIntoCommandEdge
   +- MergeIntoCommandEdge SubqueryAlias newData, SubqueryAlias oldData, Delta[version=4, ... , (((year#3761220 = 2023) AND (month#3761221 = 10)) AND ((day#3761222 = 12) AND (clientid#3761212 = clientid#3532751)))

This query runs for a long time considering I think the data I'm trying to process is small (less than 1M). Also based on this link I should see some sort of mention of partition in the physical plan but I don't. Any ideas why it seems that my merge statement is not using the partition when executing the merge statement?

Anonymous · ‎03-13-2023

@Joceph Moreno : Based on the information you have provided, it seems like you are specifying the partition keys in the ON condition of the merge statement. However, you should be using the partition columns in the USING clause of the merge statement instead.

Csn you check if the below works for you?

MERGE INTO delta.<path of delta table> oldData 
USING (
  SELECT * FROM df WHERE year = '2023' AND month = '10' AND day = '12'
) newData 
ON oldData.clientid = newData.clientid 
WHEN MATCHED THEN DELETE;

In this example, the partition columns (year, month, day) are used in the subquery in the USING clause to filter the data being merged. The partition columns are not included in the ON condition, as they are already being used to filter the data. Instead, the clientid column is used in the ON condition to match records between the old and new data. With this approach, the merge operation should only apply to the partition containing the data that matches the filter condition in the USING clause, which should improve performance.

Please let us know if this helps.

View solution in original post

Anonymous · ‎03-13-2023

@Joceph Moreno : Based on the information you have provided, it seems like you are specifying the partition keys in the ON condition of the merge statement. However, you should be using the partition columns in the USING clause of the merge statement instead.

Csn you check if the below works for you?

MERGE INTO delta.<path of delta table> oldData 
USING (
  SELECT * FROM df WHERE year = '2023' AND month = '10' AND day = '12'
) newData 
ON oldData.clientid = newData.clientid 
WHEN MATCHED THEN DELETE;

In this example, the partition columns (year, month, day) are used in the subquery in the USING clause to filter the data being merged. The partition columns are not included in the ON condition, as they are already being used to filter the data. Instead, the clientid column is used in the ON condition to match records between the old and new data. With this approach, the merge operation should only apply to the partition containing the data that matches the filter condition in the USING clause, which should improve performance.

Please let us know if this helps.

Umesh_S · ‎03-30-2023

Isn't the suggested idea only filtering the input dataframe (resulting in a smaller amount of data to match across the whole delta table) rather than prune the delta table for relevant partitions to scan?

Databricks

How does partition pruning work on a merge into statement?

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI