Is ZORDER required after table overwrite?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-23-2022 07:05 PM
Hi,
After appending new values to a delta table, I need to delete duplicate rows.
After deleting duplicate rows using PySpark, I overwrite the table (keeping the schema).
My question is, do I have to do ZORDER again?
Another question, is there another way to drop duplicates? I tried drop duplicates using SQL with CTE but that didn't work. (Error: Delete is only supported with v2 tables.)
# Append new data:
data.write.mode("append").format("delta").saveAsTable("table_name")
# Read table:
df = spark.sql(f"SELECT * FROM {table_name}")
# Drop Duplicates:
df = df.dropDuplicates(["col1", "col2"])
# Re-write data:
df.write.format("delta").mode("overwrite").option("overwriteSchema", "false").saveAsTable(f"{table_name}")
- Labels:
-
Duplicate Rows
-
Table
-
Zorder
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-24-2022 01:55 AM
Z-Ordering is not triggered by auto-optimize. So you will have to specifically run it.
A way to avoid dups is to use merge instead of append. But it is possible that the runtime will be bigger than the append + dropdups.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-24-2022 03:25 PM
@Werner Stinckens Thank you!
I am triggering Z-Order after table is created. But to be sure, if it is it required to re-do Z-Order after doing an overwrite to Delta Table (with schema overwrite set to false).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-25-2022 01:38 AM
Z-ordering is never required, that also counts if you overwrite a z-ordered table.
The only thing that will happen is that the new data which is written is not z-ordered.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-24-2022 06:40 AM
the best way to avoid duplicates is the merge statements.
https://docs.databricks.com/sql/language-manual/delta-merge-into.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-27-2022 05:30 AM
Hii @Nurettin Ersoz
try to use incremental load of data so it will avoid duplicate and you can use full load once if you have updation in your data