article Improved Duplicate Match Handling in Delta MERGE Operations in Technical Blog

Improved Duplicate Match Handling in Delta MERGE Operations

joe_widen — Tue, 04 Feb 2025 01:39:33 GMT

Re: Improved Duplicate Match Handling in Delta MERGE Operations

BamBam — Wed, 05 Feb 2025 13:24:14 GMT

What happens in DBRS 16.1 if the WHEN condition is this:

WHEN MATCHED AND s.value IS DISTINCT FROM t.value

Basically saying when it matches on the PK but then check if the other columns are different before taking an action like UPDATE.

Re: Improved Duplicate Match Handling in Delta MERGE Operations

Mantsama4 — Thu, 06 Feb 2025 23:43:33 GMT

This is a great solution! The improved duplicate match handling in Delta MERGE operations in DBR 16.1+ is a significant enhancement, as it resolves ambiguity and ensures smoother operations by leveraging the WHEN MATCHED condition. This update not only simplifies workflows but also reduces errors, making data management more efficient. Well done!

I have a humble question: Are there any recommended cost optimization strategies that can be applied alongside this improved functionality, especially for large-scale MERGE operations? For instance, how can we ensure efficient resource utilization while maintaining performance during such operations? Thank you!

Re: Improved Duplicate Match Handling in Delta MERGE Operations

StephanieAlba — Wed, 09 Apr 2025 16:37:31 GMT

The following example made this more clear for me:

MERGE .... ON source.id = target.id
WHEN MATCHED AND source.value = target.value
WHEN NOT MATCHED THEN INSERT

Re: Improved Duplicate Match Handling in Delta MERGE Operations

BamBam — Wed, 09 Apr 2025 17:03:14 GMT

@StephanieAlba So that means sense... so using your data scenario above if we have something like this that uses a IS DISTINCT FROM instead of the "=" (shown below) then we will get an error, correct? It will still try to update two records, which is not allowed. Correct?

MERGE .... ON source.id = target.id
WHEN MATCHED AND source.value IS DISTINCT FROM target.value
WHEN NOT MATCHED THEN INSERT

Re: Improved Duplicate Match Handling in Delta MERGE Operations

BamBam — Wed, 09 Apr 2025 17:59:28 GMT

@StephanieAlba That is unfortunate and will be a breaking change. Is there any setting or SQL clause that would make this MERGE error fail in DBRS 16.0+ given the above scenario?

Re: Improved Duplicate Match Handling in Delta MERGE Operations

StephanieAlba — Wed, 09 Apr 2025 18:14:53 GMT

Oops. I thought I could edit my previous comment. Also, I checked with eng folks and this below is correct.

Re: Improved Duplicate Match Handling in Delta MERGE Operations

BamBam — Wed, 09 Apr 2025 19:50:24 GMT

@StephanieAlba -

Run this SQL:

CREATE TABLE bronze.temp.temp_source (source_id INT, source_first_name STRING, source_last_name STRING); CREATE TABLE bronze.temp.temp_target (target_id INT, target_first_name STRING, target_last_name STRING); INSERT INTO bronze.temp.temp_source VALUES(1,'John','Avery'),(2,'Jack','Rodgers'); -- INSERTED 2 records correctly MERGE INTO bronze.temp.temp_target AS DEST USING ( SELECT source_id , source_first_name , source_last_name FROM bronze.temp.temp_source ) AS SRC ON DEST.target_id = SRC.source_id WHEN MATCHED AND ( DEST.target_first_name IS DISTINCT FROM SRC.source_first_name OR DEST.target_last_name IS DISTINCT FROM SRC.source_last_name ) THEN UPDATE SET DEST.target_first_name = SRC.source_first_name , DEST.target_last_name = SRC.source_last_name WHEN NOT MATCHED THEN INSERT ( DEST.target_id , DEST.target_first_name , DEST.target_last_name ) VALUES ( SRC.source_id , SRC.source_first_name , SRC.source_last_name ) ; UPDATE bronze.temp.temp_source SET source_first_name = 'Randy' WHERE source_id = 1; -- Rerun MERGE.. 1 row updated as expected INSERT INTO bronze.temp.temp_source VALUES(1,'Greg','Nelson'); -- Rerun MERGE.. 1 row updated... 😞 It should have errored. -- Rerunning the merge over and over again just keeps changing the record. INSERT INTO bronze.temp.temp_source VALUES(1,'Mary','Wilson'); -- Rerun MERGE... now there are 3 records with an ID = 1. The MERGE above fails.

If there are two rows with an ID = 1, then the MERGE just keeps updating every time you run. If there are three rows with an ID = 1, then the MERGE will fail. Kinda crazy.

We really need an option, setting or clause in this MERGE statement to allow it to behave like DBRS <= 16+ that would fail if it finds more than one record. Thoughts?

Re: Improved Duplicate Match Handling in Delta MERGE Operations

StephanieAlba — Wed, 09 Apr 2025 19:53:36 GMT

Tested it! Here you go

# Sample data data = [("1", "A", 30), ("2", "A", 25), ("3", "B", 35)] columns = ["ID", "Letter", "Value"] # Create DataFrame df = spark.createDataFrame(data, columns) df.write.mode("overwrite").saveAsTable("test_data")

Look at the data quickselect * from test_data

Try merging bad data

MERGE INTO test_data t USING (SELECT * FROM VALUES ("1", "A", 31), ("1", "A", 25), ("3", "B", 36)) s ON t.ID = s.col1 WHEN MATCHED AND t.Letter = s.col2 THEN UPDATE SET t.ID = s.col1, t.Letter = s.col2, t.Value = s.col3 WHEN NOT MATCHED THEN INSERT (`ID`, `Letter`, `Value`) VALUES (s.col1, s.col2, s.col3);

Throws and error [DELTA_MULTIPLE_SOURCE_ROW_MATCHING_TARGET_ROW_IN_MERGE] Cannot perform Merge as multiple source rows matched and attempted to modify the same target row in the Delta table in possibly conflicting ways. By SQL semantics of Merge, when multiple source rows match on the same target row, the result may be ambiguous

Try merging good data

MERGE INTO test_data t USING (SELECT * FROM VALUES ("1", "A", 31), ("1", "D", 25), ("3", "B", 36)) s ON t.ID = s.col1 WHEN MATCHED AND t.Letter = s.col2 THEN UPDATE SET t.ID = s.col1, t.Letter = s.col2, t.Value = s.col3 WHEN NOT MATCHED THEN INSERT (`ID`, `Letter`, `Value`) VALUES (s.col1, s.col2, s.col3);

Check the updates!

select * from test_data

Re: Improved Duplicate Match Handling in Delta MERGE Operations

StephanieAlba — Wed, 09 Apr 2025 20:16:20 GMT

Hey @BamBam,

I see your code now. I am testing it.

INSERT INTO bronze.temp.temp_source
VALUES(1,'Greg','Nelson');

-- Rerun MERGE.. 1 row updated... 😞  It should have errored.

I am confused. The names are different, so it should be updated.

Re: Improved Duplicate Match Handling in Delta MERGE Operations

BamBam — Wed, 09 Apr 2025 20:47:37 GMT

@StephanieAlba - Thanks for trying that code. Sorry in the DBRS 15.4 this would have failed because it has two ID = 1 records. In DBRS 16.x it doesn't fail. Yes, see if you can get the DBRS 16.x to fail when there are three ID = 1 records, but it doesn't fail if there are two ID = 1 records. That was the unusual part.

Re: Improved Duplicate Match Handling in Delta MERGE Operations

StephanieAlba — Thu, 10 Apr 2025 19:02:07 GMT

@BamBam I tried it, but didn't get an error. What am I missing? See below:

Sample example just adding some new values for insert.

MERGE INTO test_data t USING (SELECT * FROM VALUES ("4", "A", 1), ("4", "D", 2), ("4", "B", 3)) s ON t.ID = s.col1 WHEN MATCHED AND t.Letter = s.col2 THEN UPDATE SET t.ID = s.col1, t.Letter = s.col2, t.Value = s.col3 WHEN NOT MATCHED THEN INSERT (`ID`, `Letter`, `Value`) VALUES (s.col1, s.col2, s.col3);

[works] Attempting to merge rows with IDs = 4 and matching only on one

MERGE INTO test_data t USING (SELECT * FROM VALUES ("4", "A", 4), ("4", "E", 5), ("4", "G", 6)) s ON t.ID = s.col1 WHEN MATCHED AND t.Letter = s.col2 THEN UPDATE SET t.ID = s.col1, t.Letter = s.col2, t.Value = s.col3 WHEN NOT MATCHED THEN INSERT (`ID`, `Letter`, `Value`) VALUES (s.col1, s.col2, s.col3);

[works] Adding 3 rows all matched

MERGE INTO test_data t USING (SELECT * FROM VALUES ("4", "A", 7), ("4", "B", 8), ("4", "D", 9)) s ON t.ID = s.col1 WHEN MATCHED AND t.Letter = s.col2 THEN UPDATE SET t.ID = s.col1, t.Letter = s.col2, t.Value = s.col3 WHEN NOT MATCHED THEN INSERT (`ID`, `Letter`, `Value`) VALUES (s.col1, s.col2, s.col3);

[fails] Two conflicting matches

MERGE INTO test_data t USING (SELECT * FROM VALUES ("4", "A", 10), ("4", "A", 11), ("4", "Q", 12)) s ON t.ID = s.col1 WHEN MATCHED AND t.Letter = s.col2 THEN UPDATE SET t.ID = s.col1, t.Letter = s.col2, t.Value = s.col3 WHEN NOT MATCHED THEN INSERT (`ID`, `Letter`, `Value`) VALUES (s.col1, s.col2, s.col3);

Re: Improved Duplicate Match Handling in Delta MERGE Operations

StephanieAlba — Fri, 25 Apr 2025 16:32:39 GMT

@BamBam you can email me use my first name at databricks.com