โ09-23-2024 03:07 AM
Hello
I've seen posts that show how to remove duplicates, something like this:
MERGE into [deltatable] as target
USING ( select *, ROW_NUMBER() OVER (Partition By [primary keys] Order By [date] desc) as rn from [deltatable] qualify rn> 1 ) as source
ON [merge primary keys and date column between source and target]
WHEN MATCHED THEN DELETE
Problem is this is reliant upon the rows having something to order by as a differentiator.
If I use the same column to order then use ON source.<PK> = target.<PK>, it just removes every row (so same as just a straight delete)
My data is an exact copy so I have 7 duplicated rows in the table
This is pretty easy to delete in SQL Server but I cannot find an easy method in databricks
Do I have to maybe delete and compare versions then insert from a previous version with a distinct?
Or create a temp table, insert with distinct then truncate source and copy back?
Both these methods seem long-winded
Any thoughts?
โ09-23-2024 06:59 AM - edited โ09-23-2024 07:00 AM
Please try this pyspark solution.
df = spark.table("yourtablename")
df = df.dropDuplicates()
or use the Row_number() query and use DELETE statement where rn > 1
โ09-23-2024 06:59 AM - edited โ09-23-2024 07:00 AM
Please try this pyspark solution.
df = spark.table("yourtablename")
df = df.dropDuplicates()
or use the Row_number() query and use DELETE statement where rn > 1
โ09-25-2024 09:17 AM
I originally tried this:
Row_number() query and use DELETE statement where rn > 1
It removes all rows no matter how I piece the code together
Found this method works well in SQL Server but not databricks
โ09-23-2024 07:43 AM
Hi @dpc ,
if you like using SQL:
1. Test data:
# Sample data
data = [("1", "A"), ("1", "A"), ("2", "B"), ("2", "B"), ("3", "C")]
# Create DataFrame
df = spark.createDataFrame(data, ["id", "value"])
# Write to Delta table
df.write.format("delta").mode("overwrite").saveAsTable("duplicates")
2. Removing the duplicates from the table
%sql
CREATE OR REPLACE TEMP VIEW dedups AS
SELECT DISTINCT * FROM duplicates;
-- Overwrite the existing Delta table with distinct rows
INSERT OVERWRITE duplicates
SELECT * FROM dedups;
SELECT *
FROM duplicates;
3. No duplicates in duplicates:
โ09-25-2024 09:15 AM
Thanks. This looks more straight forward than I thought
I'll give it a go in the morning and see if it works
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now