cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT Primary Key Deduplication: Expectations vs. Constraints vs. Other?

ChristianRRL
Valued Contributor

I'm trying to figure out what's the best way to "de-duplicate" data via DLT. Currently, my only leads are:

I see problems with both of these options.

First, with "Expectations" this functionality seems a bit limited/cumbersome compared to what I would expect. I understand that I can set "filters" to trigger certain expectations, but it doesn't seem smart/sophisticated enough to deduplicate based on simple primary keys.

Second, with "Constraints" this functionality seems closer to what I'm looking for *however* (1) it doesn't look like it's programmatic at all (e.g. requires tables to be pre-defined), and (2) it seems to require Unity Catalog which we don't have enabled in our organization.

I'd like to ask, am I missing something? Data deduplication based on primary keys should be a fairly straightforward and industry standard practice. I would think Databricks/DLT should have a simple/intuitive way to handle this (especially for a company advocating for a medallion architecture), but I can't seem to find good leads on deduplicating data via DLT!

2 REPLIES 2

Palash01
Valued Contributor

Hey @ChristianRRL ,

Based on my understanding you want to de-duplicate your data during your DLT pipeline processing unfortunately I was not able to find a solution to this when I ran into this problem due to the native feature limitations.

Limitations of native features:

  • Delta Table lacks a built-in drop duplicates method: Although Delta Tables offer functionalities like partition pruning and merge operations, there isn't a dedicated method for deduplication based on specific columns like primary keys.
  • DLT Live Tables don't enforce primary key constraints: While DLT Live Tables support defining primary keys, those aren't currently enforced, meaning duplicates can still slip through during inserts.

Existing approaches:

  • Custom SQL queries: You can achieve deduplication by writing custom SQL queries using DISTINCT window functions with ROW_NUMBER based on your primary key combination. While effective, this can be cumbersome and requires manual maintenance.
  • Delta Merge with ROW_NUMBER: This involves using Delta Merge operations with a step to identify and delete duplicate rows based on ROW_NUMBER calculated over the primary key columns. While more concise than pure SQL queries, it's still somewhat complex and requires careful scripting.

this being said I also tried a workaround but you'll need to test it based on your data and requirements. But yes you can read the data in a data frame outside of your return function and then do a dropDuplicate() on that data frame before or in your return function, the other way could be you use a merge statement on your data frame before returning the data frame through the return function.

 

@dlt.table
def customers():
  df = spark.read.format("file_format").load("example_table/parquet/csv")
  return (
  df.dropDuplicates()
)

 

 or else you can use a spark job to de-dup using a merge statement more details on merge can be found here: https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html

 

MERGE into [deltatable] as target
USING ( select *, ROW_NUMBER() OVER (Partition By [primary keys] Order By [date] desc) as rn  from [deltatable]) t1 qualify rn> 1 ) as source
ON [merge primary keys and date column between source and target]
WHEN MATCHED THEN DELETE

 

Hope this helps! 

Leave a like if this helps! Kudos,
Palash

Thank you @Palash01! I hear what you're saying, plus I just attended a Databricks office hours session and they also recommended both approaches you suggested.

One quick note on your "Limitations" points:


@Palash01 wrote:

...

Limitations of native features:

  • Delta Table lacks a built-in drop duplicates method: Although Delta Tables offer functionalities like partition pruning and merge operations, there isn't a dedicated method for deduplication based on specific columns like primary keys.
  • DLT Live Tables don't enforce primary key constraints: While DLT Live Tables support defining primary keys, those aren't currently enforced, meaning duplicates can still slip through during inserts.

Both points (particularly the 2nd one) seem like a big oversight from Databricks. Data deduplication and primary key enforcement are very common in the data field so I would have thought DLT or Autoloader should have some functionality to support this.

I'm curious if you/anyone would know if Databricks is working on new features supporting either of these missing native features? Or am I missing something big/obvious as to why DLT wouldn't natively support either of these features?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group