โ05-16-2023 06:10 AM
A deltaTable.dropDuplicates(columns) would be a very nice feature, simplifying the complex procedures that are suggested online.
Or am I missing any existing procedures that can be done withouth merge operations or similar?
โ05-16-2023 02:43 PM
I created a feature request in the delta table project: [Feature Request] data deduplication on existing delta table ยท Issue #1767 ยท delta-io/delta (github....
โ05-16-2023 10:14 AM
It would be helpful. Currently, the best way is just to read the table as a dataframe and use Pyspark dropDuplicates().
# Load the table
df = spark.table("yourtable")
# Drop duplicates based on the Id and Name columns
df = df.dropDuplicates(["Id", "Name"])
# Overwrite the original table with the resulting dataframe
df.write.mode("overwrite").saveAsTable("yourtable")
โ01-04-2024 07:29 AM
This worked perfectly, and much easier than all the complex solutions that are suggested online.
โ06-23-2024 07:55 AM
This is basically wiping and rewriting the whole table. Obviously it's a very easy solution, but very expensive.
There's a reason why the "usual" solutions are very complex, because they only target the duplicated rows.
a month ago
is this still the best method?
โ05-16-2023 02:43 PM
I created a feature request in the delta table project: [Feature Request] data deduplication on existing delta table ยท Issue #1767 ยท delta-io/delta (github....
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group