08-11-2023 02:40 AM - edited 08-11-2023 02:44 AM
Hello,
I am doing some testing with this feature Change Data Feed using Databricks and Pyspark, of course the Delta format and I don't understand something:
SELECT * FROM table_changes('table', 2); version is 2
I observed a row with ID(for example 234123) appearing twice. In the _change_type column I have for first occurernce update_preimage, for the second occurrance update_postimage.
I will say that it is something normal to appear twice because maybe something changed on that ID, but if I am checking each value for both occurrences I see NO CHANGE.
It is something normal?
08-11-2023 11:54 AM
You have to split your merge statement into 2 parts. (Update and Insert/Delete).
MERGE INTO test t USING src s ON s.Id = t.Id and s.date_field = t.date_field and s.fields <> t.fields WHEN MATCHED THEN UPDATE SET *
MERGE INTO test t USING src s ON s.Id = t.Id and s.date_field = t.date_field WHEN NOT MATCHED THEN INSERT * WHEN NOT MATCHED BY SOURCE THEN DELETE
In the first statement, you have to include all the columns where you don't want the update to happen if it's of the same value and then perform update. The second statement will help you insert and delete the records.
08-13-2023 11:29 PM
Thank you for all your help.
If I will have any other questions, I'll be back 😜
08-11-2023 03:54 AM
Thank you @Retired_mod for the answer, but this behaviour it is not what CDF is meant to be.
I will explain myself:
1. With this behaviour I cannot identify which row is really updated.
2. I cannot find out the count of the real updates.
3. Storing all this data and not only the Updates/Inserts/Delete will end in a huge tabel..in few months.
4. Merging the changes to the base table, it will imply much more time, than merging the real changes.
How we can cover all of these inconveniences?
08-11-2023 04:21 AM
@Mihai_Cog When you find an entry having same values in both pre_image and post_image, it means that record was indeed updated. But there was no change in values. This record went through the merge statement, updated the column values. We will not know whether it indeed changes the value or it's going to update the same value. All the update statement in Merge does is, to identify records eligible for update and update whatever value it has on the source table.
If you only want to have real updates, I would suggest using a filter in your merge statement to eliminate records with the same value in the column you are interested.
08-11-2023 06:19 AM - edited 08-11-2023 06:58 AM
@Tharun-Kumar thank you for your answer.
To understand better, you are saying to add a filter, meaning something like this?
.whenMatchedUpdate(set =
{
"id": "updates.id",
"firstName": "updates.firstName",
"middleName": "updates.middleName",
"lastName": "updates.lastName",
"gender": "updates.gender",
"birthDate": "updates.birthDate",
"ssn": "updates.ssn",
"salary": "updates.salary"
}
)
To put all the columns I am interested in and like this I will get exactly the real changes and I can do my merge in the Upper level, only with the real changes?
08-11-2023 11:54 AM
You have to split your merge statement into 2 parts. (Update and Insert/Delete).
MERGE INTO test t USING src s ON s.Id = t.Id and s.date_field = t.date_field and s.fields <> t.fields WHEN MATCHED THEN UPDATE SET *
MERGE INTO test t USING src s ON s.Id = t.Id and s.date_field = t.date_field WHEN NOT MATCHED THEN INSERT * WHEN NOT MATCHED BY SOURCE THEN DELETE
In the first statement, you have to include all the columns where you don't want the update to happen if it's of the same value and then perform update. The second statement will help you insert and delete the records.
08-13-2023 11:29 PM
Thank you for all your help.
If I will have any other questions, I'll be back 😜
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group