cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to remove duplicates in a Delta table?

Thor
New Contributor III

I made multiple inserts (by error) in a Delta table and I have now strict duplicates, I feel like it's impossible to delete them if you don't have a column "IDENTITY" to distinguish lines (the primary key is RLOC+LOAD_DATE):

snap deleteit sounds odd to me not to be able to do any anything after data definition for removing strict identical rows. I wonder if there was still a way to distinguish them by physical address or insertion date?

https://www.sqlshack.com/different-ways-to-sql-delete-duplicate-rows-from-a-sql-table/

Otherwise I will do that:

snap identity

3 REPLIES 3

Thor
New Contributor III

maybe the information I need stands in the value '_metadata.row_index'

adeshpande
New Contributor II

I am also facing the same issue. The cte query is not working (getting error). I am trying to delete duplicate records from delta table (keeping single record)

Ken_H
New Contributor II

There are several great ways to handle this:  https://stackoverflow.com/questions/61674476/how-to-drop-duplicates-in-delta-table

This was my preference: 

with cte as(
Select col1,col2,col3,etc
,row_number()over(partition by col1,col2,col3,etc order by col1)rowno
from table)
Delete from cte where rowno>1

but since you stated you are having issues with cte, you can use the merge option also in the link above.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.