cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
joe_widen
Databricks Employee
Databricks Employee

In DBR 16.1+, we’ve improved functionality of MERGE operations where multiple rows of the source dataset match the same row of the target Delta table, but only one row matches the WHEN MATCHED condition.  In the past, these operations would fail with DELTA_MULTIPLE_SOURCE_ROW_MATCHING_TARGET_ROW_IN_MERGE.

Prior DBR’s MERGE would only use the MERGE conditions to detect multiple rows from the source dataset trying to update the target dataset leading to ambiguity.  This improved functionality allows for the WHEN MATCHED condition to help resolve the ambiguity.

As an example, let's take a look at a MERGE that will now succeed in DBR 16.1+.

MERGE INTO t USING s
ON t.id = s.id 
WHEN MATCHED AND s.value = 'b' 
THEN UPDATE SET *
WHEN NOT MATCHED 
THEN INSERT *

Target Table

ID

Value

1

a

Source Table

ID

Value

1

b

1

c

 

In this example, the source has two matching rows that could be used to update the target table.  The WHEN MATCHED condition reduces the matching rows from two matches to one match.  When running this with an older DBR, it would fail with DELTA_MULTIPLE_SOURCE_ROW_MATCHING_TARGET_ROW_IN_MERGE, even though the WHEN MATCHED condition reduced the matches to one.  The improved functionality in DBR 16.1+ will allow this to succeed

12 Comments
BamBam
New Contributor III

What happens in DBRS 16.1 if the WHEN condition is this:

WHEN MATCHED AND s.value IS DISTINCT FROM t.value

Basically saying when it matches on the PK but then check if the other columns are different before taking an action like UPDATE.

Mantsama4
Valued Contributor

This is a great solution! The improved duplicate match handling in Delta MERGE operations in DBR 16.1+ is a significant enhancement, as it resolves ambiguity and ensures smoother operations by leveraging the WHEN MATCHED condition. This update not only simplifies workflows but also reduces errors, making data management more efficient. Well done!

I have a humble question: Are there any recommended cost optimization strategies that can be applied alongside this improved functionality, especially for large-scale MERGE operations? For instance, how can we ensure efficient resource utilization while maintaining performance during such operations? Thank you!

StephanieAlba
Databricks Employee
Databricks Employee

The following example made this more clear for me:

MERGE .... ON source.id = target.id
WHEN MATCHED AND source.value = target.value
WHEN NOT MATCHED THEN INSERT

Screenshot 2025-04-08 at 10.44.42 AM.png

BamBam
New Contributor III

@StephanieAlba   So that means sense... so using your data scenario above if we have something like this that uses a IS DISTINCT FROM instead of the "=" (shown below) then we will get an error, correct?  It will still try to update two records, which is not allowed.  Correct?

MERGE .... ON source.id = target.id
WHEN MATCHED AND source.value IS DISTINCT FROM target.value
WHEN NOT MATCHED THEN INSERT

BamBam
New Contributor III

@StephanieAlba That is unfortunate and will be a breaking change.  Is there any setting or SQL clause that would make this MERGE error fail in DBRS 16.0+ given the above scenario?

StephanieAlba
Databricks Employee
Databricks Employee

Oops. I thought I could edit my previous comment. Also, I checked with eng folks and this below is correct.

Screenshot 2025-04-08 at 10.44.42 AM.png

Screenshot 2025-04-08 at 10.44.42 AM copy.png

BamBam
New Contributor III

@StephanieAlba -  

Run this SQL:

CREATE TABLE bronze.temp.temp_source (source_id INT, source_first_name STRING, source_last_name STRING);

CREATE TABLE bronze.temp.temp_target (target_id INT, target_first_name STRING, target_last_name STRING);

INSERT INTO bronze.temp.temp_source
VALUES(1,'John','Avery'),(2,'Jack','Rodgers');


-- INSERTED 2 records correctly
MERGE INTO bronze.temp.temp_target AS DEST 
USING (
	SELECT 
		source_id
		, source_first_name
		, source_last_name
	FROM bronze.temp.temp_source
) AS SRC
ON DEST.target_id = SRC.source_id
WHEN MATCHED AND (
	DEST.target_first_name IS DISTINCT FROM SRC.source_first_name
	OR DEST.target_last_name IS DISTINCT FROM SRC.source_last_name
)
THEN UPDATE SET 
	DEST.target_first_name = SRC.source_first_name
	, DEST.target_last_name = SRC.source_last_name
WHEN NOT MATCHED THEN
INSERT (
	DEST.target_id
	, DEST.target_first_name
	, DEST.target_last_name
)
VALUES
(
	SRC.source_id
	, SRC.source_first_name
	, SRC.source_last_name
)
;

UPDATE bronze.temp.temp_source
SET source_first_name = 'Randy'
WHERE source_id = 1;

-- Rerun MERGE.. 1 row updated as expected

INSERT INTO bronze.temp.temp_source
VALUES(1,'Greg','Nelson');

-- Rerun MERGE.. 1 row updated... 😞  It should have errored.
-- Rerunning the merge over and over again just keeps changing the record.


INSERT INTO bronze.temp.temp_source
VALUES(1,'Mary','Wilson');

-- Rerun MERGE... now there are 3 records with an ID = 1.  The MERGE above fails.

If there are two rows with an ID = 1, then the MERGE just keeps updating every time you run.  If there are three rows with an ID = 1, then the MERGE will fail.  Kinda crazy.

We really need an option, setting or clause in this MERGE statement to allow it to behave like DBRS <= 16+ that would fail if it finds more than one record.  Thoughts? 

StephanieAlba
Databricks Employee
Databricks Employee

Tested it! Here you go

# Sample data
data = [("1", "A", 30), ("2", "A", 25), ("3", "B", 35)]
columns = ["ID", "Letter", "Value"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

df.write.mode("overwrite").saveAsTable("test_data")
 Look at the data quick
select * from test_data​

Try merging bad data

MERGE INTO test_data t 
USING (SELECT * FROM VALUES 
            ("1", "A", 31), 
            ("1", "A", 25), 
            ("3", "B", 36)) s
ON t.ID = s.col1
WHEN MATCHED AND t.Letter = s.col2 THEN 
    UPDATE SET t.ID = s.col1, t.Letter = s.col2, t.Value = s.col3
WHEN NOT MATCHED THEN INSERT (`ID`, `Letter`, `Value`) VALUES (s.col1, s.col2, s.col3);

Throws and error [DELTA_MULTIPLE_SOURCE_ROW_MATCHING_TARGET_ROW_IN_MERGE] Cannot perform Merge as multiple source rows matched and attempted to modify the same target row in the Delta table in possibly conflicting ways. By SQL semantics of Merge, when multiple source rows match on the same target row, the result may be ambiguous

Try merging good data

MERGE INTO test_data t 
USING (SELECT * FROM VALUES 
            ("1", "A", 31), 
            ("1", "D", 25), 
            ("3", "B", 36)) s
ON t.ID = s.col1
WHEN MATCHED AND t.Letter = s.col2 THEN 
    UPDATE SET t.ID = s.col1, t.Letter = s.col2, t.Value = s.col3
WHEN NOT MATCHED THEN INSERT (`ID`, `Letter`, `Value`) VALUES (s.col1, s.col2, s.col3);

Check the updates! 

select * from test_data
StephanieAlba
Databricks Employee
Databricks Employee

Hey @BamBam,

I see your code now. I am testing it.

INSERT INTO bronze.temp.temp_source
VALUES(1,'Greg','Nelson');

-- Rerun MERGE.. 1 row updated... 😞  It should have errored.

I am confused. The names are different, so it should be updated.

BamBam
New Contributor III

@StephanieAlba  - Thanks for trying that code.  Sorry in the DBRS 15.4 this would have failed because it has two ID = 1 records.  In DBRS 16.x it doesn't fail.  Yes, see if you can get the DBRS 16.x to fail when there are three ID = 1 records, but it doesn't fail if there are two ID = 1 records.  That was the unusual part.