Hey @ChristianRRL ,
Based on my understanding you want to de-duplicate your data during your DLT pipeline processing unfortunately I was not able to find a solution to this when I ran into this problem due to the native feature limitations.
Limitations of native features:
- Delta Table lacks a built-in drop duplicates method: Although Delta Tables offer functionalities like partition pruning and merge operations, there isn't a dedicated method for deduplication based on specific columns like primary keys.
- DLT Live Tables don't enforce primary key constraints: While DLT Live Tables support defining primary keys, those aren't currently enforced, meaning duplicates can still slip through during inserts.
Existing approaches:
- Custom SQL queries: You can achieve deduplication by writing custom SQL queries using DISTINCT window functions with ROW_NUMBER based on your primary key combination. While effective, this can be cumbersome and requires manual maintenance.
- Delta Merge with ROW_NUMBER: This involves using Delta Merge operations with a step to identify and delete duplicate rows based on ROW_NUMBER calculated over the primary key columns. While more concise than pure SQL queries, it's still somewhat complex and requires careful scripting.
this being said I also tried a workaround but you'll need to test it based on your data and requirements. But yes you can read the data in a data frame outside of your return function and then do a dropDuplicate() on that data frame before or in your return function, the other way could be you use a merge statement on your data frame before returning the data frame through the return function.
@dlt.table
def customers():
df = spark.read.format("file_format").load("example_table/parquet/csv")
return (
df.dropDuplicates()
)
or else you can use a spark job to de-dup using a merge statement more details on merge can be found here: https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html
MERGE into [deltatable] as target
USING ( select *, ROW_NUMBER() OVER (Partition By [primary keys] Order By [date] desc) as rn from [deltatable]) t1 qualify rn> 1 ) as source
ON [merge primary keys and date column between source and target]
WHEN MATCHED THEN DELETE
Hope this helps!
Leave a like if this helps! Kudos,
Palash