topic Remove Duplicate rows in tables in Data Engineering

Remove Duplicate rows in tables

dpc — Mon, 23 Sep 2024 10:07:54 GMT

Hello

I've seen posts that show how to remove duplicates, something like this:

MERGE into [deltatable] as target

USING ( select *, ROW_NUMBER() OVER (Partition By [primary keys] Order By [date] desc) as rn from [deltatable] qualify rn> 1 ) as source

ON [merge primary keys and date column between source and target]

WHEN MATCHED THEN DELETE

Problem is this is reliant upon the rows having something to order by as a differentiator.

If I use the same column to order then use ON source.<PK> = target.<PK>, it just removes every row (so same as just a straight delete)

My data is an exact copy so I have 7 duplicated rows in the table

This is pretty easy to delete in SQL Server but I cannot find an easy method in databricks

Do I have to maybe delete and compare versions then insert from a previous version with a distinct?

Or create a temp table, insert with distinct then truncate source and copy back?

Both these methods seem long-winded

Any thoughts?

Re: Remove Duplicate rows in tables

gchandra — Mon, 23 Sep 2024 14:00:35 GMT

Please try this pyspark solution.
df = spark.table("yourtablename")
df = df.dropDuplicates()

or use the Row_number() query and use DELETE statement where rn > 1

Re: Remove Duplicate rows in tables

filipniziol — Mon, 23 Sep 2024 14:43:07 GMT

Hi @dpc ,

if you like using SQL:

1. Test data:

# Sample data data = [("1", "A"), ("1", "A"), ("2", "B"), ("2", "B"), ("3", "C")] # Create DataFrame df = spark.createDataFrame(data, ["id", "value"]) # Write to Delta table df.write.format("delta").mode("overwrite").saveAsTable("duplicates")

2. Removing the duplicates from the table

%sql CREATE OR REPLACE TEMP VIEW dedups AS SELECT DISTINCT * FROM duplicates; -- Overwrite the existing Delta table with distinct rows INSERT OVERWRITE duplicates SELECT * FROM dedups; SELECT * FROM duplicates;

3. No duplicates in duplicates:

Re: Remove Duplicate rows in tables

dpc — Wed, 25 Sep 2024 16:15:57 GMT

Thanks. This looks more straight forward than I thought

I'll give it a go in the morning and see if it works

Re: Remove Duplicate rows in tables

dpc — Wed, 25 Sep 2024 16:17:48 GMT

I originally tried this:

Row_number() query and use DELETE statement where rn > 1

It removes all rows no matter how I piece the code together

Found this method works well in SQL Server but not databricks