DLT Primary Key Deduplication: Expectations vs. Constraints vs. Other?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-23-2024 12:48 PM
I'm trying to figure out what's the best way to "de-duplicate" data via DLT. Currently, my only leads are:
- Manage data quality with Delta Live Tables | Databricks on AWS
- Via "Drop invalid records"
- Constraints on Databricks | Databricks on AWS
- Via "pre-defined" table constraints + UC
I see problems with both of these options.
First, with "Expectations" this functionality seems a bit limited/cumbersome compared to what I would expect. I understand that I can set "filters" to trigger certain expectations, but it doesn't seem smart/sophisticated enough to deduplicate based on simple primary keys.
Second, with "Constraints" this functionality seems closer to what I'm looking for *however* (1) it doesn't look like it's programmatic at all (e.g. requires tables to be pre-defined), and (2) it seems to require Unity Catalog which we don't have enabled in our organization.
I'd like to ask, am I missing something? Data deduplication based on primary keys should be a fairly straightforward and industry standard practice. I would think Databricks/DLT should have a simple/intuitive way to handle this (especially for a company advocating for a medallion architecture), but I can't seem to find good leads on deduplicating data via DLT!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-23-2024 10:22 PM
Hey @ChristianRRL ,
Based on my understanding you want to de-duplicate your data during your DLT pipeline processing unfortunately I was not able to find a solution to this when I ran into this problem due to the native feature limitations.
Limitations of native features:
- Delta Table lacks a built-in drop duplicates method: Although Delta Tables offer functionalities like partition pruning and merge operations, there isn't a dedicated method for deduplication based on specific columns like primary keys.
- DLT Live Tables don't enforce primary key constraints: While DLT Live Tables support defining primary keys, those aren't currently enforced, meaning duplicates can still slip through during inserts.
Existing approaches:
- Custom SQL queries: You can achieve deduplication by writing custom SQL queries using DISTINCT window functions with ROW_NUMBER based on your primary key combination. While effective, this can be cumbersome and requires manual maintenance.
- Delta Merge with ROW_NUMBER: This involves using Delta Merge operations with a step to identify and delete duplicate rows based on ROW_NUMBER calculated over the primary key columns. While more concise than pure SQL queries, it's still somewhat complex and requires careful scripting.
this being said I also tried a workaround but you'll need to test it based on your data and requirements. But yes you can read the data in a data frame outside of your return function and then do a dropDuplicate() on that data frame before or in your return function, the other way could be you use a merge statement on your data frame before returning the data frame through the return function.
@dlt.table
def customers():
df = spark.read.format("file_format").load("example_table/parquet/csv")
return (
df.dropDuplicates()
)
or else you can use a spark job to de-dup using a merge statement more details on merge can be found here: https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html
MERGE into [deltatable] as target
USING ( select *, ROW_NUMBER() OVER (Partition By [primary keys] Order By [date] desc) as rn from [deltatable]) t1 qualify rn> 1 ) as source
ON [merge primary keys and date column between source and target]
WHEN MATCHED THEN DELETE
Hope this helps!
Palash
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-24-2024 09:47 AM
Thank you @Palash01! I hear what you're saying, plus I just attended a Databricks office hours session and they also recommended both approaches you suggested.
One quick note on your "Limitations" points:
@Palash01 wrote:...
Limitations of native features:
- Delta Table lacks a built-in drop duplicates method: Although Delta Tables offer functionalities like partition pruning and merge operations, there isn't a dedicated method for deduplication based on specific columns like primary keys.
- DLT Live Tables don't enforce primary key constraints: While DLT Live Tables support defining primary keys, those aren't currently enforced, meaning duplicates can still slip through during inserts.
Both points (particularly the 2nd one) seem like a big oversight from Databricks. Data deduplication and primary key enforcement are very common in the data field so I would have thought DLT or Autoloader should have some functionality to support this.
I'm curious if you/anyone would know if Databricks is working on new features supporting either of these missing native features? Or am I missing something big/obvious as to why DLT wouldn't natively support either of these features?

