Databricks Community

MRTN · ‎05-16-2023

A deltaTable.dropDuplicates(columns) would be a very nice feature, simplifying the complex procedures that are suggested online.

Or am I missing any existing procedures that can be done withouth merge operations or similar?

MRTN · ‎05-16-2023

I created a feature request in the delta table project: [Feature Request] data deduplication on existing delta table · Issue #1767 · delta-io/delta (github....

View solution in original post

Hubert-Dudek · ‎05-16-2023

It would be helpful. Currently, the best way is just to read the table as a dataframe and use Pyspark dropDuplicates().

# Load the table
df = spark.table("yourtable")
 
# Drop duplicates based on the Id and Name columns
df = df.dropDuplicates(["Id", "Name"])
 
# Overwrite the original table with the resulting dataframe
df.write.mode("overwrite").saveAsTable("yourtable")

My blog: https://databrickster.medium.com/