Re: Feature request delta tables : drop duplicate ...

Hubert-Dudek · ‎05-16-2023

It would be helpful. Currently, the best way is just to read the table as a dataframe and use Pyspark dropDuplicates().

# Load the table
df = spark.table("yourtable")
 
# Drop duplicates based on the Id and Name columns
df = df.dropDuplicates(["Id", "Name"])
 
# Overwrite the original table with the resulting dataframe
df.write.mode("overwrite").saveAsTable("yourtable")

My blog: https://databrickster.medium.com/