cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How do I merge two tables and track changes to missing rows as well as new rows

577391
New Contributor II

In my scenario, the new data coming in are the current, valid records. Any records that are not in the new data should be labeled as 'Gone", any matching records should be labeled with "Updated". And finally, any new records should be added.

So in summary, in one merge statement I'd like to:

1. Add new rows from the current data

2. Update existing rows that are in the source table (status="Updated")

3. Update existing rows that are missing from the source table (status="Gone")

Is this possible? This is what I have so far - but it doesn't get me point 3.

Here is a non-working example of what I'm trying to do

CREATE or replace TABLE target (

 id INT NOT NULL,

 status VARCHAR(30) NOT NULL

);

INSERT INTO target (id, status)

VALUES (1, "Initial status"), (2, "Initial status"), (3, "Initial status");

CREATE or replace TABLE source (

 id INT NOT NULL,

 status VARCHAR(30) NOT NULL

);

INSERT INTO source (id, status)

VALUES (1, "Initial status"), (3, "Initial status"), (4, "Initial status") ;

MERGE INTO target t

USING source s

ON t.id = s.id

WHEN MATCHED THEN UPDATE SET t.status = "Updated"

WHEN NOT MATCHED THEN INSERT (id, status) VALUES (s.id, s.status);

select * FROM target order by id;

I've tried putting conditions on 'WHEN NOT MATCHED' and then using 'UPDATE SET', but I get the error message:

Error in SQL statement: ParseException:

mismatched input 'SET' expecting 'INSERT'(line 6, pos 39)

Is this possible?

(I'm coming from SAS and this is super easy to do)

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

Detection deletions does not work out of the box.

The merge statement will evaluate the incoming data against the existing data. It will not check the existing data against the incoming data.

To mark deletions, you will have to specifically update those records.

How do you do this?

This depends, if your source system detects deletions, they can probably be added to the incoming data and you can use the merge statement.

But in many cases deletes are not detected.

In that case you will have to compare the current state of the delta lake table against a complete current version of the data.

Getting the latter can be an issue. What I typically do (in case of an RDBMS source system) is to fetch the complete clustered index of the table (which will go pretty fast, even when there is a lot of data) and use that to compare + left_anti join.

There is also change data feed on delta lake, which you can use, but this only works if your source data is delta lake format.

Does this make any sense?

View solution in original post

2 REPLIES 2

-werners-
Esteemed Contributor III

Detection deletions does not work out of the box.

The merge statement will evaluate the incoming data against the existing data. It will not check the existing data against the incoming data.

To mark deletions, you will have to specifically update those records.

How do you do this?

This depends, if your source system detects deletions, they can probably be added to the incoming data and you can use the merge statement.

But in many cases deletes are not detected.

In that case you will have to compare the current state of the delta lake table against a complete current version of the data.

Getting the latter can be an issue. What I typically do (in case of an RDBMS source system) is to fetch the complete clustered index of the table (which will go pretty fast, even when there is a lot of data) and use that to compare + left_anti join.

There is also change data feed on delta lake, which you can use, but this only works if your source data is delta lake format.

Does this make any sense?

577391
New Contributor II

Yes, this helps! Thank you. Knowing that it doesn't happen out of the box helps me a lot.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group