Databricks Community

joe_widen · ‎02-03-2025

In DBR 16.1+, we’ve improved functionality of MERGE operations where multiple rows of the source dataset match the same row of the target Delta table, but only one row matches the WHEN MATCHED condition. In the past, these operations would fail with DELTA_MULTIPLE_SOURCE_ROW_MATCHING_TARGET_ROW_IN_MERGE.

Prior DBR’s MERGE would only use the MERGE conditions to detect multiple rows from the source dataset trying to update the target dataset leading to ambiguity. This improved functionality allows for the WHEN MATCHED condition to help resolve the ambiguity.

As an example, let's take a look at a MERGE that will now succeed in DBR 16.1+.

MERGE INTO t USING s
ON t.id = s.id 
WHEN MATCHED AND s.value = 'b' 
THEN UPDATE SET *
WHEN NOT MATCHED 
THEN INSERT *

Target Table

ID	Value
1	a

Source Table

ID	Value
1	b
1	c

In this example, the source has two matching rows that could be used to update the target table. The WHEN MATCHED condition reduces the matching rows from two matches to one match. When running this with an older DBR, it would fail with DELTA_MULTIPLE_SOURCE_ROW_MATCHING_TARGET_ROW_IN_MERGE, even though the WHEN MATCHED condition reduced the matches to one. The improved functionality in DBR 16.1+ will allow this to succeed

BamBam · ‎02-05-2025

What happens in DBRS 16.1 if the WHEN condition is this:

WHEN MATCHED AND s.value IS DISTINCT FROM t.value

Basically saying when it matches on the PK but then check if the other columns are different before taking an action like UPDATE.

Mantsama4 · ‎02-06-2025

This is a great solution! The improved duplicate match handling in Delta MERGE operations in DBR 16.1+ is a significant enhancement, as it resolves ambiguity and ensures smoother operations by leveraging the WHEN MATCHED condition. This update not only simplifies workflows but also reduces errors, making data management more efficient. Well done!

I have a humble question: Are there any recommended cost optimization strategies that can be applied alongside this improved functionality, especially for large-scale MERGE operations? For instance, how can we ensure efficient resource utilization while maintaining performance during such operations? Thank you!

Databricks Community

Improved Duplicate Match Handling in Delta MERGE Operations

Metadata-Driven ETL Framework in Databricks (Part-1)

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL