Databricks Community

MRTN · ‎05-16-2023

A deltaTable.dropDuplicates(columns) would be a very nice feature, simplifying the complex procedures that are suggested online.

Or am I missing any existing procedures that can be done withouth merge operations or similar?

MRTN · ‎05-16-2023

I created a feature request in the delta table project: [Feature Request] data deduplication on existing delta table · Issue #1767 · delta-io/delta (github....

View solution in original post

Hubert-Dudek · ‎05-16-2023

It would be helpful. Currently, the best way is just to read the table as a dataframe and use Pyspark dropDuplicates().

# Load the table
df = spark.table("yourtable")
 
# Drop duplicates based on the Id and Name columns
df = df.dropDuplicates(["Id", "Name"])
 
# Overwrite the original table with the resulting dataframe
df.write.mode("overwrite").saveAsTable("yourtable")

DRGutierrez · ‎01-04-2024

This worked perfectly, and much easier than all the complex solutions that are suggested online.

Victor_D · ‎06-23-2024

This is basically wiping and rewriting the whole table. Obviously it's a very easy solution, but very expensive.

There's a reason why the "usual" solutions are very complex, because they only target the duplicated rows.

akshaybhan92 · ‎12-16-2024

is this still the best method?

MRTN · ‎05-16-2023

I created a feature request in the delta table project: [Feature Request] data deduplication on existing delta table · Issue #1767 · delta-io/delta (github....

Databricks Community

Feature request delta tables : drop duplicate rows

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 28 – December 04, 2025

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐