cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Remove Duplicate rows in tables

dpc
New Contributor III

Hello

I've seen posts that show how to remove duplicates, something like this:

MERGE into [deltatable] as target

USING ( select *, ROW_NUMBER() OVER (Partition By [primary keys] Order By [date] desc) as rn  from [deltatable] qualify rn> 1 ) as source

ON [merge primary keys and date column between source and target]

WHEN MATCHED THEN DELETE

Problem is this is reliant upon the rows having something to order by as a differentiator.

If I use the same column to order then use ON source.<PK> = target.<PK>, it just removes every row (so same as just a straight delete)

 My data is an exact copy so I have 7 duplicated rows in the table

This is pretty easy to delete in SQL Server but I cannot find an easy method in databricks

Do I have to maybe delete and compare versions then insert from a previous version with a distinct?

Or create a temp table, insert with distinct then truncate source and copy back?

Both these methods seem long-winded

 

Any thoughts?

1 ACCEPTED SOLUTION

Accepted Solutions

gchandra
Databricks Employee
Databricks Employee

Please try this pyspark solution. 
df = spark.table("yourtablename")
df = df.dropDuplicates()

or use the Row_number() query and use DELETE statement where rn > 1



~

View solution in original post

4 REPLIES 4

gchandra
Databricks Employee
Databricks Employee

Please try this pyspark solution. 
df = spark.table("yourtablename")
df = df.dropDuplicates()

or use the Row_number() query and use DELETE statement where rn > 1



~

dpc
New Contributor III

I originally tried this:

Row_number() query and use DELETE statement where rn > 1

It removes all rows no matter how I piece the code together

Found this method works well in SQL Server but not databricks

filipniziol
Contributor

Hi @dpc ,

if you like using SQL:

1. Test data:

# Sample data
data = [("1", "A"), ("1", "A"), ("2", "B"), ("2", "B"), ("3", "C")]

# Create DataFrame
df = spark.createDataFrame(data, ["id", "value"])

# Write to Delta table
df.write.format("delta").mode("overwrite").saveAsTable("duplicates")

2. Removing the duplicates from the table

%sql
CREATE OR REPLACE TEMP VIEW dedups AS
SELECT DISTINCT * FROM duplicates;

-- Overwrite the existing Delta table with distinct rows
INSERT OVERWRITE duplicates
SELECT * FROM dedups;

SELECT *
FROM duplicates;

3. No duplicates in duplicates:

filipniziol_0-1727102570106.png

 

 

dpc
New Contributor III

Thanks. This looks more straight forward than I thought

I'll give it a go in the morning and see if it works

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group