How to remove duplicates in a Delta table?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-19-2023 01:48 AM
I made multiple inserts (by error) in a Delta table and I have now strict duplicates, I feel like it's impossible to delete them if you don't have a column "IDENTITY" to distinguish lines (the primary key is RLOC+LOAD_DATE):
it sounds odd to me not to be able to do any anything after data definition for removing strict identical rows. I wonder if there was still a way to distinguish them by physical address or insertion date?
https://www.sqlshack.com/different-ways-to-sql-delete-duplicate-rows-from-a-sql-table/
Otherwise I will do that:
- Labels:
-
Delta
-
Delta table
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-22-2023 02:42 AM
maybe the information I need stands in the value '_metadata.row_index'
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-10-2023 06:34 AM
I am also facing the same issue. The cte query is not working (getting error). I am trying to delete duplicate records from delta table (keeping single record)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-08-2023 08:58 AM - edited 08-08-2023 09:00 AM
There are several great ways to handle this: https://stackoverflow.com/questions/61674476/how-to-drop-duplicates-in-delta-table
This was my preference:
with cte as(
Select col1,col2,col3,etc
,row_number()over(partition by col1,col2,col3,etc order by col1)rowno
from table)
Delete from cte where rowno>1
but since you stated you are having issues with cte, you can use the merge option also in the link above.

