cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Is ZORDER required after table overwrite?

numersoz
New Contributor III

Hi,

After appending new values to a delta table, I need to delete duplicate rows.

After deleting duplicate rows using PySpark, I overwrite the table (keeping the schema).

My question is, do I have to do ZORDER again?

Another question, is there another way to drop duplicates? I tried drop duplicates using SQL with CTE but that didn't work. (Error: Delete is only supported with v2 tables.)

# Append new data:
data.write.mode("append").format("delta").saveAsTable("table_name")
 
# Read table:
df = spark.sql(f"SELECT * FROM {table_name}")
# Drop Duplicates:
df = df.dropDuplicates(["col1", "col2"])
# Re-write data:
df.write.format("delta").mode("overwrite").option("overwriteSchema", "false").saveAsTable(f"{table_name}")

5 REPLIES 5

-werners-
Esteemed Contributor III

Z-Ordering is not triggered by auto-optimize. So you will have to specifically run it.

A way to avoid dups is to use merge instead of append. But it is possible that the runtime will be bigger than the append + dropdups.

numersoz
New Contributor III

@Werner Stinckens​ Thank you!

I am triggering Z-Order after table is created. But to be sure, if it is it required to re-do Z-Order after doing an overwrite to Delta Table (with schema overwrite set to false).

-werners-
Esteemed Contributor III

Z-ordering is never required, that also counts if you overwrite a z-ordered table.

The only thing that will happen is that the new data which is written is not z-ordered.

yogu
Honored Contributor III

the best way to avoid duplicates is the merge statements.

https://docs.databricks.com/sql/language-manual/delta-merge-into.html

DeepakMakwana74
New Contributor III

Hii @Nurettin Ersoz​ 

try to use incremental load of data so it will avoid duplicate and you can use full load once if you have updation in your data

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group