<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to Prevent Duplicate Entries to enter to delta lake of Azure Storage in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-prevent-duplicate-entries-to-enter-to-delta-lake-of-azure/m-p/24948#M17365</link>
    <description>&lt;P&gt;According to the documentation, COPY INTO is supposed to be idempotent, and on successive runs, it shouldn't be reloading already loaded files. In my case, I created a table from existing data in S3 (many files). Then, hoping to load only newly arrived files (batch ingestion), I tried COPY INTO, but it went ahead and naively reloaded everything from S3. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I also tried with MERGE, but it looks like the source can't be parquet files in S3, it can only be similarly a Delta Table?&lt;/P&gt;</description>
    <pubDate>Tue, 04 Oct 2022 06:10:26 GMT</pubDate>
    <dc:creator>652852</dc:creator>
    <dc:date>2022-10-04T06:10:26Z</dc:date>
    <item>
      <title>How to Prevent Duplicate Entries to enter to delta lake of Azure Storage</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-prevent-duplicate-entries-to-enter-to-delta-lake-of-azure/m-p/24946#M17363</link>
      <description>&lt;P&gt;I Have a Dataframe stored in the format of delta into Adls, now when im trying to append new updated rows to that delta lake it should, Is there any way where i can delete the old existing record in delta and add the new updated Record.&lt;/P&gt;&lt;P&gt;There is a unique Column for the schema of DataFrame stored in Delta. by which we can check whether the record is updated or new.&lt;/P&gt;</description>
      <pubDate>Fri, 11 Jun 2021 15:27:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-prevent-duplicate-entries-to-enter-to-delta-lake-of-azure/m-p/24946#M17363</guid>
      <dc:creator>User16826994223</dc:creator>
      <dc:date>2021-06-11T15:27:59Z</dc:date>
    </item>
    <item>
      <title>Re: How to Prevent Duplicate Entries to enter to delta lake of Azure Storage</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-prevent-duplicate-entries-to-enter-to-delta-lake-of-azure/m-p/24947#M17364</link>
      <description>&lt;P&gt;You should use a MERGE command on this table to match records on the unique column. Delta Lake does not enforce primary keys so if you append only the duplicate ids will appear. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Merge will provide you the functionality you desire. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-merge-into.html" target="test_blank"&gt;https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-merge-into.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 18 Jun 2021 21:44:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-prevent-duplicate-entries-to-enter-to-delta-lake-of-azure/m-p/24947#M17364</guid>
      <dc:creator>Ryan_Chynoweth</dc:creator>
      <dc:date>2021-06-18T21:44:32Z</dc:date>
    </item>
    <item>
      <title>Re: How to Prevent Duplicate Entries to enter to delta lake of Azure Storage</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-prevent-duplicate-entries-to-enter-to-delta-lake-of-azure/m-p/24948#M17365</link>
      <description>&lt;P&gt;According to the documentation, COPY INTO is supposed to be idempotent, and on successive runs, it shouldn't be reloading already loaded files. In my case, I created a table from existing data in S3 (many files). Then, hoping to load only newly arrived files (batch ingestion), I tried COPY INTO, but it went ahead and naively reloaded everything from S3. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I also tried with MERGE, but it looks like the source can't be parquet files in S3, it can only be similarly a Delta Table?&lt;/P&gt;</description>
      <pubDate>Tue, 04 Oct 2022 06:10:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-prevent-duplicate-entries-to-enter-to-delta-lake-of-azure/m-p/24948#M17365</guid>
      <dc:creator>652852</dc:creator>
      <dc:date>2022-10-04T06:10:26Z</dc:date>
    </item>
  </channel>
</rss>

