<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How to remove duplicates in a Delta table? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-remove-duplicates-in-a-delta-table/m-p/4164#M964</link>
    <description>&lt;P&gt;I made multiple inserts (by error) in a Delta table and I have now strict duplicates, I feel like it's impossible to delete them if you don't have a column "IDENTITY" to distinguish lines (the primary key is RLOC+LOAD_DATE):&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="snap delete"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/183iD6CBE47CDD8E6E44/image-size/large?v=v2&amp;amp;px=999" role="button" title="snap delete" alt="snap delete" /&gt;&lt;/span&gt;it sounds odd to me not to be able to do any anything after data definition for removing strict identical rows. I wonder if there was still a way to distinguish them by physical address or insertion date?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://www.sqlshack.com/different-ways-to-sql-delete-duplicate-rows-from-a-sql-table/" alt="https://www.sqlshack.com/different-ways-to-sql-delete-duplicate-rows-from-a-sql-table/" target="_blank"&gt;https://www.sqlshack.com/different-ways-to-sql-delete-duplicate-rows-from-a-sql-table/&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Otherwise I will do that:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="snap identity"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/187i0AF0160BA35B45B4/image-size/large?v=v2&amp;amp;px=999" role="button" title="snap identity" alt="snap identity" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 19 May 2023 08:48:04 GMT</pubDate>
    <dc:creator>Thor</dc:creator>
    <dc:date>2023-05-19T08:48:04Z</dc:date>
    <item>
      <title>How to remove duplicates in a Delta table?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-remove-duplicates-in-a-delta-table/m-p/4164#M964</link>
      <description>&lt;P&gt;I made multiple inserts (by error) in a Delta table and I have now strict duplicates, I feel like it's impossible to delete them if you don't have a column "IDENTITY" to distinguish lines (the primary key is RLOC+LOAD_DATE):&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="snap delete"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/183iD6CBE47CDD8E6E44/image-size/large?v=v2&amp;amp;px=999" role="button" title="snap delete" alt="snap delete" /&gt;&lt;/span&gt;it sounds odd to me not to be able to do any anything after data definition for removing strict identical rows. I wonder if there was still a way to distinguish them by physical address or insertion date?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://www.sqlshack.com/different-ways-to-sql-delete-duplicate-rows-from-a-sql-table/" alt="https://www.sqlshack.com/different-ways-to-sql-delete-duplicate-rows-from-a-sql-table/" target="_blank"&gt;https://www.sqlshack.com/different-ways-to-sql-delete-duplicate-rows-from-a-sql-table/&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Otherwise I will do that:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="snap identity"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/187i0AF0160BA35B45B4/image-size/large?v=v2&amp;amp;px=999" role="button" title="snap identity" alt="snap identity" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 19 May 2023 08:48:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-remove-duplicates-in-a-delta-table/m-p/4164#M964</guid>
      <dc:creator>Thor</dc:creator>
      <dc:date>2023-05-19T08:48:04Z</dc:date>
    </item>
    <item>
      <title>Re: How to remove duplicates in a Delta table?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-remove-duplicates-in-a-delta-table/m-p/4165#M965</link>
      <description>&lt;P&gt;maybe the information I need stands in the value '_metadata.row_index'&lt;/P&gt;</description>
      <pubDate>Mon, 22 May 2023 09:42:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-remove-duplicates-in-a-delta-table/m-p/4165#M965</guid>
      <dc:creator>Thor</dc:creator>
      <dc:date>2023-05-22T09:42:22Z</dc:date>
    </item>
    <item>
      <title>Re: How to remove duplicates in a Delta table?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-remove-duplicates-in-a-delta-table/m-p/37310#M26318</link>
      <description>&lt;P&gt;I am also facing the same issue. The cte query is not working (getting error). I am trying to delete duplicate records from delta table (keeping single record)&lt;/P&gt;</description>
      <pubDate>Mon, 10 Jul 2023 13:34:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-remove-duplicates-in-a-delta-table/m-p/37310#M26318</guid>
      <dc:creator>adeshpande</dc:creator>
      <dc:date>2023-07-10T13:34:53Z</dc:date>
    </item>
    <item>
      <title>Re: How to remove duplicates in a Delta table?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-remove-duplicates-in-a-delta-table/m-p/39383#M26949</link>
      <description>&lt;P&gt;There are several great ways to handle this:&amp;nbsp; &lt;A href="https://stackoverflow.com/questions/61674476/how-to-drop-duplicates-in-delta-table" target="_blank" rel="noopener"&gt;https://stackoverflow.com/questions/61674476/how-to-drop-duplicates-in-delta-table&lt;/A&gt;&lt;/P&gt;&lt;P&gt;This was my preference:&amp;nbsp;&lt;/P&gt;&lt;P&gt;with cte as(&lt;BR /&gt;Select col1,col2,col3,etc&lt;BR /&gt;,row_number()over(partition by col1,col2,col3,etc order by col1)rowno&lt;BR /&gt;from table)&lt;BR /&gt;Delete from cte where rowno&amp;gt;1&lt;/P&gt;&lt;P&gt;but since you stated you are having issues with cte, you can use the merge option also in the link above.&lt;/P&gt;</description>
      <pubDate>Tue, 08 Aug 2023 16:00:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-remove-duplicates-in-a-delta-table/m-p/39383#M26949</guid>
      <dc:creator>Ken_H</dc:creator>
      <dc:date>2023-08-08T16:00:29Z</dc:date>
    </item>
  </channel>
</rss>

