<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic DLT Primary Key Deduplication: Expectations vs. Constraints vs. Other? in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/dlt-primary-key-deduplication-expectations-vs-constraints-vs/m-p/58290#M2326</link>
    <description>&lt;P&gt;I'm trying to figure out what's the best way to "de-duplicate" data via DLT. Currently, my only leads are:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;A href="https://docs.databricks.com/en/delta-live-tables/expectations.html" target="_blank"&gt;Manage data quality with Delta Live Tables | Databricks on AWS&lt;/A&gt;&lt;UL&gt;&lt;LI&gt;Via "Drop invalid records"&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;A href="https://docs.databricks.com/en/tables/constraints.html" target="_blank"&gt;Constraints on Databricks | Databricks on AWS&lt;/A&gt;&lt;UL&gt;&lt;LI&gt;Via "pre-defined" table constraints + UC&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I see problems with both of these options.&lt;/P&gt;&lt;P&gt;First, with "&lt;STRONG&gt;Expectations&lt;/STRONG&gt;" this functionality seems a bit limited/cumbersome compared to what I would expect. I understand that I can set "filters" to trigger certain expectations, but it doesn't seem smart/sophisticated enough to deduplicate based on simple primary keys.&lt;/P&gt;&lt;P&gt;Second, with "&lt;STRONG&gt;Constraints&lt;/STRONG&gt;" this functionality seems closer to what I'm looking for *however* (1) it doesn't look like it's programmatic at all (e.g. requires tables to be pre-defined), and (2) it seems to require Unity Catalog which we don't have enabled in our organization.&lt;/P&gt;&lt;P&gt;I'd like to ask, am I missing something? Data deduplication based on primary keys should be a fairly straightforward and industry standard practice. I would think Databricks/DLT should have a simple/intuitive way to handle this (especially for a company advocating for a medallion architecture), but I can't seem to find good leads on deduplicating data via DLT!&lt;/P&gt;</description>
    <pubDate>Tue, 23 Jan 2024 20:48:35 GMT</pubDate>
    <dc:creator>ChristianRRL</dc:creator>
    <dc:date>2024-01-23T20:48:35Z</dc:date>
    <item>
      <title>DLT Primary Key Deduplication: Expectations vs. Constraints vs. Other?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/dlt-primary-key-deduplication-expectations-vs-constraints-vs/m-p/58290#M2326</link>
      <description>&lt;P&gt;I'm trying to figure out what's the best way to "de-duplicate" data via DLT. Currently, my only leads are:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;A href="https://docs.databricks.com/en/delta-live-tables/expectations.html" target="_blank"&gt;Manage data quality with Delta Live Tables | Databricks on AWS&lt;/A&gt;&lt;UL&gt;&lt;LI&gt;Via "Drop invalid records"&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;A href="https://docs.databricks.com/en/tables/constraints.html" target="_blank"&gt;Constraints on Databricks | Databricks on AWS&lt;/A&gt;&lt;UL&gt;&lt;LI&gt;Via "pre-defined" table constraints + UC&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I see problems with both of these options.&lt;/P&gt;&lt;P&gt;First, with "&lt;STRONG&gt;Expectations&lt;/STRONG&gt;" this functionality seems a bit limited/cumbersome compared to what I would expect. I understand that I can set "filters" to trigger certain expectations, but it doesn't seem smart/sophisticated enough to deduplicate based on simple primary keys.&lt;/P&gt;&lt;P&gt;Second, with "&lt;STRONG&gt;Constraints&lt;/STRONG&gt;" this functionality seems closer to what I'm looking for *however* (1) it doesn't look like it's programmatic at all (e.g. requires tables to be pre-defined), and (2) it seems to require Unity Catalog which we don't have enabled in our organization.&lt;/P&gt;&lt;P&gt;I'd like to ask, am I missing something? Data deduplication based on primary keys should be a fairly straightforward and industry standard practice. I would think Databricks/DLT should have a simple/intuitive way to handle this (especially for a company advocating for a medallion architecture), but I can't seem to find good leads on deduplicating data via DLT!&lt;/P&gt;</description>
      <pubDate>Tue, 23 Jan 2024 20:48:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/dlt-primary-key-deduplication-expectations-vs-constraints-vs/m-p/58290#M2326</guid>
      <dc:creator>ChristianRRL</dc:creator>
      <dc:date>2024-01-23T20:48:35Z</dc:date>
    </item>
    <item>
      <title>Re: DLT Primary Key Deduplication: Expectations vs. Constraints vs. Other?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/dlt-primary-key-deduplication-expectations-vs-constraints-vs/m-p/58304#M2330</link>
      <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/96188"&gt;@ChristianRRL&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Based on my understanding you want to de-duplicate your data during your DLT pipeline processing unfortunately I was not able to find a solution to this when I ran into this problem due to the native feature limitations.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Limitations of native features:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Delta Table lacks a built-in&lt;SPAN&gt;&amp;nbsp;drop duplicates&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;method:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;Although Delta Tables offer functionalities like partition pruning and merge operations,&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;there isn't a dedicated method for deduplication based on specific columns like primary keys.&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;DLT Live Tables don't enforce primary key constraints:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;While DLT Live Tables support defining primary keys,&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;those aren't currently enforced,&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;meaning duplicates can still slip through during inserts.&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;Existing approaches:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Custom SQL queries:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;You can achieve deduplication by writing custom SQL queries using&amp;nbsp;&lt;/SPAN&gt;DISTINCT&lt;SPAN&gt;&amp;nbsp;window functions with&amp;nbsp;&lt;/SPAN&gt;ROW_NUMBER&lt;SPAN&gt;&amp;nbsp;based on your primary key combination.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;While effective,&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;this can be cumbersome and requires manual maintenance.&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Delta Merge with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;ROW_NUMBER:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;This involves using Delta Merge operations with a step to identify and delete duplicate rows based on&amp;nbsp;&lt;/SPAN&gt;ROW_NUMBER&lt;SPAN&gt;&amp;nbsp;calculated over the primary key columns.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;While more concise than pure SQL queries,&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;it's still somewhat complex and requires careful scripting.&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;SPAN&gt;this being said I also tried a workaround but you'll need to test it based on your data and requirements. But yes you can read the data in a data frame outside of your return function and then do a dropDuplicate() on that data frame before or in your return function, the other way could be you use a merge statement on your data frame before returning the data frame through the return function.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;@dlt.table
def customers():
  df = spark.read.format("file_format").load("example_table/parquet/csv")
  return (
  df.dropDuplicates()
)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;or else you can use a spark job to de-dup using a merge statement more details on merge can be found here: &lt;A href="https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html" target="_blank" rel="noopener"&gt;https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;MERGE into [deltatable] as target
USING ( select *, ROW_NUMBER() OVER (Partition By [primary keys] Order By [date] desc) as rn  from [deltatable]) t1 qualify rn&amp;gt; 1 ) as source
ON [merge primary keys and date column between source and target]
WHEN MATCHED THEN DELETE&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hope this helps!&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 24 Jan 2024 06:22:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/dlt-primary-key-deduplication-expectations-vs-constraints-vs/m-p/58304#M2330</guid>
      <dc:creator>Palash01</dc:creator>
      <dc:date>2024-01-24T06:22:29Z</dc:date>
    </item>
    <item>
      <title>Re: DLT Primary Key Deduplication: Expectations vs. Constraints vs. Other?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/dlt-primary-key-deduplication-expectations-vs-constraints-vs/m-p/58348#M2335</link>
      <description>&lt;P&gt;Thank you&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/98703"&gt;@Palash01&lt;/a&gt;! I hear what you're saying, plus I just attended a Databricks office hours session and they also recommended both approaches you suggested.&lt;/P&gt;&lt;P&gt;One quick note on your "Limitations" points:&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/98703"&gt;@Palash01&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;&lt;P&gt;...&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Limitations of native features:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Delta Table lacks a built-in&lt;SPAN&gt;&amp;nbsp;drop duplicates&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;method:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;Although Delta Tables offer functionalities like partition pruning and merge operations,&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;there isn't a dedicated method for deduplication based on specific columns like primary keys.&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT color="#800000"&gt;&lt;STRONG&gt;DLT Live Tables don't enforce primary key constraints:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;While DLT Live Tables support defining primary keys,&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;those aren't currently enforced,&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;meaning duplicates can still slip through during inserts.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;Both points (particularly the 2nd one) seem like a big oversight from Databricks. Data deduplication and primary key enforcement are very common in the data field so I would have thought DLT or Autoloader should have some functionality to support this.&lt;/P&gt;&lt;P&gt;I'm curious if you/anyone would know if Databricks is working on new features supporting either of these missing native features? Or am I missing something big/obvious as to why DLT wouldn't natively support either of these features?&lt;/P&gt;</description>
      <pubDate>Wed, 24 Jan 2024 17:47:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/dlt-primary-key-deduplication-expectations-vs-constraints-vs/m-p/58348#M2335</guid>
      <dc:creator>ChristianRRL</dc:creator>
      <dc:date>2024-01-24T17:47:44Z</dc:date>
    </item>
  </channel>
</rss>

