Databricks Community

joe_widen · ‎02-03-2025

In DBR 16.1+, we’ve improved functionality of MERGE operations where multiple rows of the source dataset match the same row of the target Delta table, but only one row matches the WHEN MATCHED condition. In the past, these operations would fail with DELTA_MULTIPLE_SOURCE_ROW_MATCHING_TARGET_ROW_IN_MERGE.

Prior DBR’s MERGE would only use the MERGE conditions to detect multiple rows from the source dataset trying to update the target dataset leading to ambiguity. This improved functionality allows for the WHEN MATCHED condition to help resolve the ambiguity.

As an example, let's take a look at a MERGE that will now succeed in DBR 16.1+.

MERGE INTO t USING s
ON t.id = s.id 
WHEN MATCHED AND s.value = 'b' 
THEN UPDATE SET *
WHEN NOT MATCHED 
THEN INSERT *

Target Table

ID	Value
1	a

Source Table

ID	Value
1	b
1	c

In this example, the source has two matching rows that could be used to update the target table. The WHEN MATCHED condition reduces the matching rows from two matches to one match. When running this with an older DBR, it would fail with DELTA_MULTIPLE_SOURCE_ROW_MATCHING_TARGET_ROW_IN_MERGE, even though the WHEN MATCHED condition reduced the matches to one. The improved functionality in DBR 16.1+ will allow this to succeed

BamBam · ‎02-05-2025

What happens in DBRS 16.1 if the WHEN condition is this:

WHEN MATCHED AND s.value IS DISTINCT FROM t.value

Basically saying when it matches on the PK but then check if the other columns are different before taking an action like UPDATE.

Mantsama4 · ‎02-06-2025

This is a great solution! The improved duplicate match handling in Delta MERGE operations in DBR 16.1+ is a significant enhancement, as it resolves ambiguity and ensures smoother operations by leveraging the WHEN MATCHED condition. This update not only simplifies workflows but also reduces errors, making data management more efficient. Well done!

I have a humble question: Are there any recommended cost optimization strategies that can be applied alongside this improved functionality, especially for large-scale MERGE operations? For instance, how can we ensure efficient resource utilization while maintaining performance during such operations? Thank you!

StephanieAlba · ‎04-09-2025

The following example made this more clear for me:

MERGE .... ON source.id = target.id
WHEN MATCHED AND source.value = target.value
WHEN NOT MATCHED THEN INSERT

Screenshot 2025-04-08 at 10.44.42 AM.png

BamBam · ‎04-09-2025

@StephanieAlba So that means sense... so using your data scenario above if we have something like this that uses a IS DISTINCT FROM instead of the "=" (shown below) then we will get an error, correct? It will still try to update two records, which is not allowed. Correct?

MERGE .... ON source.id = target.id
WHEN MATCHED AND source.value IS DISTINCT FROM target.value
WHEN NOT MATCHED THEN INSERT

BamBam · ‎04-09-2025

@StephanieAlba That is unfortunate and will be a breaking change. Is there any setting or SQL clause that would make this MERGE error fail in DBRS 16.0+ given the above scenario?

StephanieAlba · ‎04-09-2025

Oops. I thought I could edit my previous comment. Also, I checked with eng folks and this below is correct.

Screenshot 2025-04-08 at 10.44.42 AM.png

Screenshot 2025-04-08 at 10.44.42 AM copy.png

BamBam · ‎04-09-2025

@StephanieAlba -

Run this SQL:

CREATE TABLE bronze.temp.temp_source (source_id INT, source_first_name STRING, source_last_name STRING);

CREATE TABLE bronze.temp.temp_target (target_id INT, target_first_name STRING, target_last_name STRING);

INSERT INTO bronze.temp.temp_source
VALUES(1,'John','Avery'),(2,'Jack','Rodgers');


-- INSERTED 2 records correctly
MERGE INTO bronze.temp.temp_target AS DEST 
USING (
	SELECT 
		source_id
		, source_first_name
		, source_last_name
	FROM bronze.temp.temp_source
) AS SRC
ON DEST.target_id = SRC.source_id
WHEN MATCHED AND (
	DEST.target_first_name IS DISTINCT FROM SRC.source_first_name
	OR DEST.target_last_name IS DISTINCT FROM SRC.source_last_name
)
THEN UPDATE SET 
	DEST.target_first_name = SRC.source_first_name
	, DEST.target_last_name = SRC.source_last_name
WHEN NOT MATCHED THEN
INSERT (
	DEST.target_id
	, DEST.target_first_name
	, DEST.target_last_name
)
VALUES
(
	SRC.source_id
	, SRC.source_first_name
	, SRC.source_last_name
)
;

UPDATE bronze.temp.temp_source
SET source_first_name = 'Randy'
WHERE source_id = 1;

-- Rerun MERGE.. 1 row updated as expected

INSERT INTO bronze.temp.temp_source
VALUES(1,'Greg','Nelson');

-- Rerun MERGE.. 1 row updated... 😞  It should have errored.
-- Rerunning the merge over and over again just keeps changing the record.


INSERT INTO bronze.temp.temp_source
VALUES(1,'Mary','Wilson');

-- Rerun MERGE... now there are 3 records with an ID = 1.  The MERGE above fails.

If there are two rows with an ID = 1, then the MERGE just keeps updating every time you run. If there are three rows with an ID = 1, then the MERGE will fail. Kinda crazy.

We really need an option, setting or clause in this MERGE statement to allow it to behave like DBRS <= 16+ that would fail if it finds more than one record. Thoughts?

StephanieAlba · ‎04-09-2025

Tested it! Here you go

# Sample data
data = [("1", "A", 30), ("2", "A", 25), ("3", "B", 35)]
columns = ["ID", "Letter", "Value"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

df.write.mode("overwrite").saveAsTable("test_data")

Look at the data quick

select * from test_data

Try merging bad data

MERGE INTO test_data t 
USING (SELECT * FROM VALUES 
            ("1", "A", 31), 
            ("1", "A", 25), 
            ("3", "B", 36)) s
ON t.ID = s.col1
WHEN MATCHED AND t.Letter = s.col2 THEN 
    UPDATE SET t.ID = s.col1, t.Letter = s.col2, t.Value = s.col3
WHEN NOT MATCHED THEN INSERT (`ID`, `Letter`, `Value`) VALUES (s.col1, s.col2, s.col3);

Throws and error [DELTA_MULTIPLE_SOURCE_ROW_MATCHING_TARGET_ROW_IN_MERGE] Cannot perform Merge as multiple source rows matched and attempted to modify the same target row in the Delta table in possibly conflicting ways. By SQL semantics of Merge, when multiple source rows match on the same target row, the result may be ambiguous

Try merging good data

MERGE INTO test_data t 
USING (SELECT * FROM VALUES 
            ("1", "A", 31), 
            ("1", "D", 25), 
            ("3", "B", 36)) s
ON t.ID = s.col1
WHEN MATCHED AND t.Letter = s.col2 THEN 
    UPDATE SET t.ID = s.col1, t.Letter = s.col2, t.Value = s.col3
WHEN NOT MATCHED THEN INSERT (`ID`, `Letter`, `Value`) VALUES (s.col1, s.col2, s.col3);

Check the updates!

select * from test_data

StephanieAlba · ‎04-09-2025

Hey @BamBam,

I see your code now. I am testing it.

INSERT INTO bronze.temp.temp_source
VALUES(1,'Greg','Nelson');

-- Rerun MERGE.. 1 row updated... 😞  It should have errored.

I am confused. The names are different, so it should be updated.

BamBam · ‎04-09-2025

@StephanieAlba - Thanks for trying that code. Sorry in the DBRS 15.4 this would have failed because it has two ID = 1 records. In DBRS 16.x it doesn't fail. Yes, see if you can get the DBRS 16.x to fail when there are three ID = 1 records, but it doesn't fail if there are two ID = 1 records. That was the unusual part.

Databricks Community

Improved Duplicate Match Handling in Delta MERGE Operations

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks