โ05-16-2023 06:10 AM
A deltaTable.dropDuplicates(columns) would be a very nice feature, simplifying the complex procedures that are suggested online.
Or am I missing any existing procedures that can be done withouth merge operations or similar?
โ05-16-2023 02:43 PM
I created a feature request in the delta table project: [Feature Request] data deduplication on existing delta table ยท Issue #1767 ยท delta-io/delta (github....
โ05-16-2023 10:14 AM
It would be helpful. Currently, the best way is just to read the table as a dataframe and use Pyspark dropDuplicates().
# Load the table
df = spark.table("yourtable")
# Drop duplicates based on the Id and Name columns
df = df.dropDuplicates(["Id", "Name"])
# Overwrite the original table with the resulting dataframe
df.write.mode("overwrite").saveAsTable("yourtable")
โ01-04-2024 07:29 AM
This worked perfectly, and much easier than all the complex solutions that are suggested online.
โ06-23-2024 07:55 AM
This is basically wiping and rewriting the whole table. Obviously it's a very easy solution, but very expensive.
There's a reason why the "usual" solutions are very complex, because they only target the duplicated rows.
โ12-16-2024 01:17 PM
is this still the best method?
โ05-16-2023 02:43 PM
I created a feature request in the delta table project: [Feature Request] data deduplication on existing delta table ยท Issue #1767 ยท delta-io/delta (github....
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now