Remove Duplicate rows in tables

dpc
Contributor III

Hello

I've seen posts that show how to remove duplicates, something like this:

MERGE into [deltatable] as target

USING ( select *, ROW_NUMBER() OVER (Partition By [primary keys] Order By [date] desc) as rn  from [deltatable] qualify rn> 1 ) as source

ON [merge primary keys and date column between source and target]

WHEN MATCHED THEN DELETE

Problem is this is reliant upon the rows having something to order by as a differentiator.

If I use the same column to order then use ON source.<PK> = target.<PK>, it just removes every row (so same as just a straight delete)

 My data is an exact copy so I have 7 duplicated rows in the table

This is pretty easy to delete in SQL Server but I cannot find an easy method in databricks

Do I have to maybe delete and compare versions then insert from a previous version with a distinct?

Or create a temp table, insert with distinct then truncate source and copy back?

Both these methods seem long-winded

 

Any thoughts?

gchandra
Databricks Employee
Databricks Employee

Please try this pyspark solution. 
df = spark.table("yourtablename")
df = df.dropDuplicates()

or use the Row_number() query and use DELETE statement where rn > 1



~

View solution in original post

dpc
Contributor III

I originally tried this:

Row_number() query and use DELETE statement where rn > 1

It removes all rows no matter how I piece the code together

Found this method works well in SQL Server but not databricks

filipniziol
Esteemed Contributor

Hi @dpc ,

if you like using SQL:

1. Test data:

# Sample data
data = [("1", "A"), ("1", "A"), ("2", "B"), ("2", "B"), ("3", "C")]

# Create DataFrame
df = spark.createDataFrame(data, ["id", "value"])

# Write to Delta table
df.write.format("delta").mode("overwrite").saveAsTable("duplicates")

2. Removing the duplicates from the table

%sql
CREATE OR REPLACE TEMP VIEW dedups AS
SELECT DISTINCT * FROM duplicates;

-- Overwrite the existing Delta table with distinct rows
INSERT OVERWRITE duplicates
SELECT * FROM dedups;

SELECT *
FROM duplicates;

3. No duplicates in duplicates:

filipniziol_0-1727102570106.png

 

 

Thanks. This looks more straight forward than I thought

I'll give it a go in the morning and see if it works