Databricks Community

ChristianRRL · ‎01-23-2024

I'm trying to figure out what's the best way to "de-duplicate" data via DLT. Currently, my only leads are:

Manage data quality with Delta Live Tables | Databricks on AWS
- Via "Drop invalid records"
Constraints on Databricks | Databricks on AWS
- Via "pre-defined" table constraints + UC

I see problems with both of these options.

First, with "Expectations" this functionality seems a bit limited/cumbersome compared to what I would expect. I understand that I can set "filters" to trigger certain expectations, but it doesn't seem smart/sophisticated enough to deduplicate based on simple primary keys.

Second, with "Constraints" this functionality seems closer to what I'm looking for *however* (1) it doesn't look like it's programmatic at all (e.g. requires tables to be pre-defined), and (2) it seems to require Unity Catalog which we don't have enabled in our organization.

I'd like to ask, am I missing something? Data deduplication based on primary keys should be a fairly straightforward and industry standard practice. I would think Databricks/DLT should have a simple/intuitive way to handle this (especially for a company advocating for a medallion architecture), but I can't seem to find good leads on deduplicating data via DLT!

Palash01 · ‎01-23-2024

Hey @ChristianRRL ,

Based on my understanding you want to de-duplicate your data during your DLT pipeline processing unfortunately I was not able to find a solution to this when I ran into this problem due to the native feature limitations.

Limitations of native features:

Delta Table lacks a built-in drop duplicates method: Although Delta Tables offer functionalities like partition pruning and merge operations, there isn't a dedicated method for deduplication based on specific columns like primary keys.
DLT Live Tables don't enforce primary key constraints: While DLT Live Tables support defining primary keys, those aren't currently enforced, meaning duplicates can still slip through during inserts.

Existing approaches:

Custom SQL queries: You can achieve deduplication by writing custom SQL queries using DISTINCT window functions with ROW_NUMBER based on your primary key combination. While effective, this can be cumbersome and requires manual maintenance.
Delta Merge with ROW_NUMBER: This involves using Delta Merge operations with a step to identify and delete duplicate rows based on ROW_NUMBER calculated over the primary key columns. While more concise than pure SQL queries, it's still somewhat complex and requires careful scripting.

this being said I also tried a workaround but you'll need to test it based on your data and requirements. But yes you can read the data in a data frame outside of your return function and then do a dropDuplicate() on that data frame before or in your return function, the other way could be you use a merge statement on your data frame before returning the data frame through the return function.

@dlt.table
def customers():
  df = spark.read.format("file_format").load("example_table/parquet/csv")
  return (
  df.dropDuplicates()
)

or else you can use a spark job to de-dup using a merge statement more details on merge can be found here: https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html

MERGE into [deltatable] as target
USING ( select *, ROW_NUMBER() OVER (Partition By [primary keys] Order By [date] desc) as rn  from [deltatable]) t1 qualify rn> 1 ) as source
ON [merge primary keys and date column between source and target]
WHEN MATCHED THEN DELETE

Hope this helps!

Leave a like if this helps! Kudos,
Palash

ChristianRRL · ‎01-24-2024

Thank you @Palash01! I hear what you're saying, plus I just attended a Databricks office hours session and they also recommended both approaches you suggested.

One quick note on your "Limitations" points:

@Palash01 wrote:
...
Limitations of native features:
Delta Table lacks a built-in drop duplicates method: Although Delta Tables offer functionalities like partition pruning and merge operations, there isn't a dedicated method for deduplication based on specific columns like primary keys.
DLT Live Tables don't enforce primary key constraints: While DLT Live Tables support defining primary keys, those aren't currently enforced, meaning duplicates can still slip through during inserts.

Both points (particularly the 2nd one) seem like a big oversight from Databricks. Data deduplication and primary key enforcement are very common in the data field so I would have thought DLT or Autoloader should have some functionality to support this.

I'm curious if you/anyone would know if Databricks is working on new features supporting either of these missing native features? Or am I missing something big/obvious as to why DLT wouldn't natively support either of these features?