โ11-23-2022 07:05 PM
Hi,
After appending new values to a delta table, I need to delete duplicate rows.
After deleting duplicate rows using PySpark, I overwrite the table (keeping the schema).
My question is, do I have to do ZORDER again?
Another question, is there another way to drop duplicates? I tried drop duplicates using SQL with CTE but that didn't work. (Error: Delete is only supported with v2 tables.)
# Append new data:
data.write.mode("append").format("delta").saveAsTable("table_name")
# Read table:
df = spark.sql(f"SELECT * FROM {table_name}")
# Drop Duplicates:
df = df.dropDuplicates(["col1", "col2"])
# Re-write data:
df.write.format("delta").mode("overwrite").option("overwriteSchema", "false").saveAsTable(f"{table_name}")
โ11-24-2022 01:55 AM
Z-Ordering is not triggered by auto-optimize. So you will have to specifically run it.
A way to avoid dups is to use merge instead of append. But it is possible that the runtime will be bigger than the append + dropdups.
โ11-24-2022 03:25 PM
@Werner Stinckensโ Thank you!
I am triggering Z-Order after table is created. But to be sure, if it is it required to re-do Z-Order after doing an overwrite to Delta Table (with schema overwrite set to false).
โ11-25-2022 01:38 AM
Z-ordering is never required, that also counts if you overwrite a z-ordered table.
The only thing that will happen is that the new data which is written is not z-ordered.
โ11-24-2022 06:40 AM
the best way to avoid duplicates is the merge statements.
https://docs.databricks.com/sql/language-manual/delta-merge-into.html
โ11-27-2022 05:30 AM
Hii @Nurettin Ersozโ
try to use incremental load of data so it will avoid duplicate and you can use full load once if you have updation in your data
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group