Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-16-2023 10:14 AM
It would be helpful. Currently, the best way is just to read the table as a dataframe and use Pyspark dropDuplicates().
# Load the table
df = spark.table("yourtable")
# Drop duplicates based on the Id and Name columns
df = df.dropDuplicates(["Id", "Name"])
# Overwrite the original table with the resulting dataframe
df.write.mode("overwrite").saveAsTable("yourtable")
My blog: https://databrickster.medium.com/