Databricks Community

Mihai_Cog · ‎08-11-2023

Hello,

I am doing some testing with this feature Change Data Feed using Databricks and Pyspark, of course the Delta format and I don't understand something:

I created a table
Saved some data inside
Enabled Change Data Feed feature
Apply a merge with a dataframe that involves inserts, deletes, updates
spark.sql('MERGE INTO test t USING src s ON s.Id = t.Id and s.date_field = t.date_field WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * WHEN NOT MATCHED BY SOURCE THEN DELETE')
Inspect data

SELECT * FROM table_changes('table', 2); version is 2

I observed a row with ID(for example 234123) appearing twice. In the _change_type column I have for first occurernce update_preimage, for the second occurrance update_postimage.

I will say that it is something normal to appear twice because maybe something changed on that ID, but if I am checking each value for both occurrences I see NO CHANGE.

It is something normal?

Tharun-Kumar · ‎08-11-2023

@Mihai_Cog

You have to split your merge statement into 2 parts. (Update and Insert/Delete).

MERGE INTO test t USING src s ON s.Id = t.Id and s.date_field = t.date_field and s.fields <> t.fields WHEN MATCHED THEN UPDATE SET *

MERGE INTO test t USING src s ON s.Id = t.Id and s.date_field = t.date_field WHEN NOT MATCHED THEN INSERT * WHEN NOT MATCHED BY SOURCE THEN DELETE

In the first statement, you have to include all the columns where you don't want the update to happen if it's of the same value and then perform update. The second statement will help you insert and delete the records.

View solution in original post

Mihai_Cog · ‎08-13-2023

Thank you for all your help.

If I will have any other questions, I'll be back 😜

View solution in original post

Mihai_Cog · ‎08-11-2023

Thank you @Retired_mod for the answer, but this behaviour it is not what CDF is meant to be.

I will explain myself:

1. With this behaviour I cannot identify which row is really updated.

2. I cannot find out the count of the real updates.

3. Storing all this data and not only the Updates/Inserts/Delete will end in a huge tabel..in few months.

4. Merging the changes to the base table, it will imply much more time, than merging the real changes.

How we can cover all of these inconveniences?

Tharun-Kumar · ‎08-11-2023

@Mihai_Cog When you find an entry having same values in both pre_image and post_image, it means that record was indeed updated. But there was no change in values. This record went through the merge statement, updated the column values. We will not know whether it indeed changes the value or it's going to update the same value. All the update statement in Merge does is, to identify records eligible for update and update whatever value it has on the source table.

If you only want to have real updates, I would suggest using a filter in your merge statement to eliminate records with the same value in the column you are interested.

Mihai_Cog · ‎08-11-2023

@Tharun-Kumar thank you for your answer.

To understand better, you are saying to add a filter, meaning something like this?

.whenMatchedUpdate(set =
{
"id": "updates.id",
"firstName": "updates.firstName",
"middleName": "updates.middleName",
"lastName": "updates.lastName",
"gender": "updates.gender",
"birthDate": "updates.birthDate",
"ssn": "updates.ssn",
"salary": "updates.salary"
}
)

To put all the columns I am interested in and like this I will get exactly the real changes and I can do my merge in the Upper level, only with the real changes?

Tharun-Kumar · ‎08-11-2023

@Mihai_Cog

You have to split your merge statement into 2 parts. (Update and Insert/Delete).

MERGE INTO test t USING src s ON s.Id = t.Id and s.date_field = t.date_field and s.fields <> t.fields WHEN MATCHED THEN UPDATE SET *

MERGE INTO test t USING src s ON s.Id = t.Id and s.date_field = t.date_field WHEN NOT MATCHED THEN INSERT * WHEN NOT MATCHED BY SOURCE THEN DELETE

In the first statement, you have to include all the columns where you don't want the update to happen if it's of the same value and then perform update. The second statement will help you insert and delete the records.