- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-16-2023 06:10 AM
A deltaTable.dropDuplicates(columns) would be a very nice feature, simplifying the complex procedures that are suggested online.
Or am I missing any existing procedures that can be done withouth merge operations or similar?
- Labels:
-
Feature request
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-16-2023 02:43 PM
I created a feature request in the delta table project: [Feature Request] data deduplication on existing delta table · Issue #1767 · delta-io/delta (github....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-16-2023 10:14 AM
It would be helpful. Currently, the best way is just to read the table as a dataframe and use Pyspark dropDuplicates().
# Load the table
df = spark.table("yourtable")
# Drop duplicates based on the Id and Name columns
df = df.dropDuplicates(["Id", "Name"])
# Overwrite the original table with the resulting dataframe
df.write.mode("overwrite").saveAsTable("yourtable")
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-04-2024 07:29 AM
This worked perfectly, and much easier than all the complex solutions that are suggested online.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2024 07:55 AM
This is basically wiping and rewriting the whole table. Obviously it's a very easy solution, but very expensive.
There's a reason why the "usual" solutions are very complex, because they only target the duplicated rows.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-16-2024 01:17 PM
is this still the best method?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-16-2023 02:43 PM
I created a feature request in the delta table project: [Feature Request] data deduplication on existing delta table · Issue #1767 · delta-io/delta (github....

