09-23-2024 03:07 AM
Hello
I've seen posts that show how to remove duplicates, something like this:
MERGE into [deltatable] as target
USING ( select *, ROW_NUMBER() OVER (Partition By [primary keys] Order By [date] desc) as rn from [deltatable] qualify rn> 1 ) as source
ON [merge primary keys and date column between source and target]
WHEN MATCHED THEN DELETE
Problem is this is reliant upon the rows having something to order by as a differentiator.
If I use the same column to order then use ON source.<PK> = target.<PK>, it just removes every row (so same as just a straight delete)
My data is an exact copy so I have 7 duplicated rows in the table
This is pretty easy to delete in SQL Server but I cannot find an easy method in databricks
Do I have to maybe delete and compare versions then insert from a previous version with a distinct?
Or create a temp table, insert with distinct then truncate source and copy back?
Both these methods seem long-winded
Any thoughts?
09-23-2024 06:59 AM - edited 09-23-2024 07:00 AM
Please try this pyspark solution.
df = spark.table("yourtablename")
df = df.dropDuplicates()
or use the Row_number() query and use DELETE statement where rn > 1
09-23-2024 06:59 AM - edited 09-23-2024 07:00 AM
Please try this pyspark solution.
df = spark.table("yourtablename")
df = df.dropDuplicates()
or use the Row_number() query and use DELETE statement where rn > 1
09-25-2024 09:17 AM
I originally tried this:
Row_number() query and use DELETE statement where rn > 1
It removes all rows no matter how I piece the code together
Found this method works well in SQL Server but not databricks
09-23-2024 07:43 AM
Hi @dpc ,
if you like using SQL:
1. Test data:
# Sample data
data = [("1", "A"), ("1", "A"), ("2", "B"), ("2", "B"), ("3", "C")]
# Create DataFrame
df = spark.createDataFrame(data, ["id", "value"])
# Write to Delta table
df.write.format("delta").mode("overwrite").saveAsTable("duplicates")
2. Removing the duplicates from the table
%sql
CREATE OR REPLACE TEMP VIEW dedups AS
SELECT DISTINCT * FROM duplicates;
-- Overwrite the existing Delta table with distinct rows
INSERT OVERWRITE duplicates
SELECT * FROM dedups;
SELECT *
FROM duplicates;
3. No duplicates in duplicates:
09-25-2024 09:15 AM
Thanks. This looks more straight forward than I thought
I'll give it a go in the morning and see if it works
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group