Hi,
After appending new values to a delta table, I need to delete duplicate rows.
After deleting duplicate rows using PySpark, I overwrite the table (keeping the schema).
My question is, do I have to do ZORDER again?
Another question, is there another way to drop duplicates? I tried drop duplicates using SQL with CTE but that didn't work. (Error: Delete is only supported with v2 tables.)
# Append new data:
data.write.mode("append").format("delta").saveAsTable("table_name")
# Read table:
df = spark.sql(f"SELECT * FROM {table_name}")
# Drop Duplicates:
df = df.dropDuplicates(["col1", "col2"])
# Re-write data:
df.write.format("delta").mode("overwrite").option("overwriteSchema", "false").saveAsTable(f"{table_name}")