Hello
I've seen posts that show how to remove duplicates, something like this:
MERGE into [deltatable] as target
USING ( select *, ROW_NUMBER() OVER (Partition By [primary keys] Order By [date] desc) as rn from [deltatable] qualify rn> 1 ) as source
ON [merge primary keys and date column between source and target]
WHEN MATCHED THEN DELETE
Problem is this is reliant upon the rows having something to order by as a differentiator.
If I use the same column to order then use ON source.<PK> = target.<PK>, it just removes every row (so same as just a straight delete)
My data is an exact copy so I have 7 duplicated rows in the table
This is pretty easy to delete in SQL Server but I cannot find an easy method in databricks
Do I have to maybe delete and compare versions then insert from a previous version with a distinct?
Or create a temp table, insert with distinct then truncate source and copy back?
Both these methods seem long-winded
Any thoughts?