Databricks Community

dpc · ‎09-23-2024

Hello

I've seen posts that show how to remove duplicates, something like this:

MERGE into [deltatable] as target

USING ( select *, ROW_NUMBER() OVER (Partition By [primary keys] Order By [date] desc) as rn from [deltatable] qualify rn> 1 ) as source

ON [merge primary keys and date column between source and target]

WHEN MATCHED THEN DELETE

Problem is this is reliant upon the rows having something to order by as a differentiator.

If I use the same column to order then use ON source.<PK> = target.<PK>, it just removes every row (so same as just a straight delete)

My data is an exact copy so I have 7 duplicated rows in the table

This is pretty easy to delete in SQL Server but I cannot find an easy method in databricks

Do I have to maybe delete and compare versions then insert from a previous version with a distinct?

Or create a temp table, insert with distinct then truncate source and copy back?

Both these methods seem long-winded

Any thoughts?

gchandra · ‎09-23-2024

Please try this pyspark solution.
df = spark.table("yourtablename")
df = df.dropDuplicates()

or use the Row_number() query and use DELETE statement where rn > 1

~

View solution in original post

gchandra · ‎09-23-2024

Please try this pyspark solution.
df = spark.table("yourtablename")
df = df.dropDuplicates()

or use the Row_number() query and use DELETE statement where rn > 1

~

dpc · ‎09-25-2024

I originally tried this:

Row_number() query and use DELETE statement where rn > 1

It removes all rows no matter how I piece the code together

Found this method works well in SQL Server but not databricks

filipniziol · ‎09-23-2024

Hi @dpc ,

if you like using SQL:

1. Test data:

# Sample data
data = [("1", "A"), ("1", "A"), ("2", "B"), ("2", "B"), ("3", "C")]

# Create DataFrame
df = spark.createDataFrame(data, ["id", "value"])

# Write to Delta table
df.write.format("delta").mode("overwrite").saveAsTable("duplicates")

2. Removing the duplicates from the table

%sql
CREATE OR REPLACE TEMP VIEW dedups AS
SELECT DISTINCT * FROM duplicates;

-- Overwrite the existing Delta table with distinct rows
INSERT OVERWRITE duplicates
SELECT * FROM dedups;

SELECT *
FROM duplicates;

3. No duplicates in duplicates: