Databricks Community

numersoz · ‎11-23-2022

Hi,

After appending new values to a delta table, I need to delete duplicate rows.

After deleting duplicate rows using PySpark, I overwrite the table (keeping the schema).

My question is, do I have to do ZORDER again?

Another question, is there another way to drop duplicates? I tried drop duplicates using SQL with CTE but that didn't work. (Error: Delete is only supported with v2 tables.)

# Append new data:
data.write.mode("append").format("delta").saveAsTable("table_name")
 
# Read table:
df = spark.sql(f"SELECT * FROM {table_name}")
# Drop Duplicates:
df = df.dropDuplicates(["col1", "col2"])
# Re-write data:
df.write.format("delta").mode("overwrite").option("overwriteSchema", "false").saveAsTable(f"{table_name}")

-werners- · ‎11-24-2022

Z-Ordering is not triggered by auto-optimize. So you will have to specifically run it.

A way to avoid dups is to use merge instead of append. But it is possible that the runtime will be bigger than the append + dropdups.

numersoz · ‎11-24-2022

@Werner Stinckens Thank you!

I am triggering Z-Order after table is created. But to be sure, if it is it required to re-do Z-Order after doing an overwrite to Delta Table (with schema overwrite set to false).

-werners- · ‎11-25-2022

Z-ordering is never required, that also counts if you overwrite a z-ordered table.

The only thing that will happen is that the new data which is written is not z-ordered.

yogu · ‎11-24-2022

the best way to avoid duplicates is the merge statements.

https://docs.databricks.com/sql/language-manual/delta-merge-into.html

DeepakMakwana74 · ‎11-27-2022

Hii @Nurettin Ersoz

try to use incremental load of data so it will avoid duplicate and you can use full load once if you have updation in your data

Databricks Community

Is ZORDER required after table overwrite?

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon