topic Re: Feature request delta tables : drop duplicate rows in Data Engineering

Feature request delta tables : drop duplicate rows

MRTN — Tue, 16 May 2023 13:10:35 GMT

A deltaTable.dropDuplicates(columns) would be a very nice feature, simplifying the complex procedures that are suggested online.

Or am I missing any existing procedures that can be done withouth merge operations or similar?

Re: Feature request delta tables : drop duplicate rows

Hubert-Dudek — Tue, 16 May 2023 17:14:34 GMT

It would be helpful. Currently, the best way is just to read the table as a dataframe and use Pyspark dropDuplicates().

# Load the table
df = spark.table("yourtable")
 
# Drop duplicates based on the Id and Name columns
df = df.dropDuplicates(["Id", "Name"])
 
# Overwrite the original table with the resulting dataframe
df.write.mode("overwrite").saveAsTable("yourtable")

Re: Feature request delta tables : drop duplicate rows

MRTN — Tue, 16 May 2023 21:43:21 GMT

I created a feature request in the delta table project: [Feature Request] data deduplication on existing delta table · Issue #1767 · delta-io/delta (github.com)

Re: Feature request delta tables : drop duplicate rows

DRGutierrez — Thu, 04 Jan 2024 15:29:42 GMT

This worked perfectly, and much easier than all the complex solutions that are suggested online.

Re: Feature request delta tables : drop duplicate rows

Victor_D — Sun, 23 Jun 2024 14:55:27 GMT

This is basically wiping and rewriting the whole table. Obviously it's a very easy solution, but very expensive.

There's a reason why the "usual" solutions are very complex, because they only target the duplicated rows.

Re: Feature request delta tables : drop duplicate rows

akshaybhan92 — Mon, 16 Dec 2024 21:17:56 GMT

is this still the best method?