<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Overwriting a delta table using DLT in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/overwriting-a-delta-table-using-dlt/m-p/75691#M35028</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;We are trying to ingest bunch of csv files that we receive on daily basis using DLT, we chose streaming table for this purpose since streaming table is append only records keep adding up on a daily basis which will cause multiple rows in downstream transformation, then is it possible to overwrite the data for the target table using DLT as and when we process new file.&lt;/P&gt;&lt;P&gt;We cannot perform Merge also as data lack's unique key.&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;#DLT&lt;/P&gt;</description>
    <pubDate>Tue, 25 Jun 2024 10:28:28 GMT</pubDate>
    <dc:creator>Prajwal_082</dc:creator>
    <dc:date>2024-06-25T10:28:28Z</dc:date>
    <item>
      <title>Overwriting a delta table using DLT</title>
      <link>https://community.databricks.com/t5/data-engineering/overwriting-a-delta-table-using-dlt/m-p/75691#M35028</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;We are trying to ingest bunch of csv files that we receive on daily basis using DLT, we chose streaming table for this purpose since streaming table is append only records keep adding up on a daily basis which will cause multiple rows in downstream transformation, then is it possible to overwrite the data for the target table using DLT as and when we process new file.&lt;/P&gt;&lt;P&gt;We cannot perform Merge also as data lack's unique key.&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;#DLT&lt;/P&gt;</description>
      <pubDate>Tue, 25 Jun 2024 10:28:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/overwriting-a-delta-table-using-dlt/m-p/75691#M35028</guid>
      <dc:creator>Prajwal_082</dc:creator>
      <dc:date>2024-06-25T10:28:28Z</dc:date>
    </item>
    <item>
      <title>Re: Overwriting a delta table using DLT</title>
      <link>https://community.databricks.com/t5/data-engineering/overwriting-a-delta-table-using-dlt/m-p/75694#M35030</link>
      <description>&lt;P&gt;I'm not entirely certain I understand the use case, but my suggestion would be to delete "duplicates" downstream on the consumer side of the table that received the data from the CSV. Could you provide more details on the specific criteria used to identify a duplicate record in your scenario?&lt;/P&gt;</description>
      <pubDate>Tue, 25 Jun 2024 10:49:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/overwriting-a-delta-table-using-dlt/m-p/75694#M35030</guid>
      <dc:creator>giuseppegrieco</dc:creator>
      <dc:date>2024-06-25T10:49:02Z</dc:date>
    </item>
    <item>
      <title>Re: Overwriting a delta table using DLT</title>
      <link>https://community.databricks.com/t5/data-engineering/overwriting-a-delta-table-using-dlt/m-p/75701#M35033</link>
      <description>&lt;P&gt;Deleting Duplicates would not ideal case here, because duplicates shouldn't be present at the first place. To identify duplicates, you think of a simple group by on unique columns key (all though there isn't a unique key) having count greater than one.&lt;/P&gt;&lt;P&gt;To understand the use case better, imagine a streaming table which will be used to ingest data from a csv file on daily basis on the first day let's say a count of 100 records were inserted. and the next day we will process new file which will have new INSERTS/UPDATES/DELETS along with the old data that was inserted in the previous load (first File). So, we end up inserting portion of data twice. Now the count has been 220(assuming that we 20 new records)&lt;/P&gt;&lt;P&gt;Hope this is helpful.&lt;/P&gt;</description>
      <pubDate>Tue, 25 Jun 2024 12:51:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/overwriting-a-delta-table-using-dlt/m-p/75701#M35033</guid>
      <dc:creator>Prajwal_082</dc:creator>
      <dc:date>2024-06-25T12:51:02Z</dc:date>
    </item>
    <item>
      <title>Re: Overwriting a delta table using DLT</title>
      <link>https://community.databricks.com/t5/data-engineering/overwriting-a-delta-table-using-dlt/m-p/75703#M35034</link>
      <description>&lt;P&gt;&lt;SPAN&gt;In your scenario, if the data loaded on day 2 also includes all the data from day 1, you can still apply a "remove duplicates" logic. For instance, you could compute a hashdiff by hashing all the columns and use this to exclude rows you've already seen. However, I believe you first need to load all the data, regardless of whether it contains duplicates. Once loaded, you can determine which rows are duplicates. Essentially, you need to examine the data before identifying duplicates.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 25 Jun 2024 13:16:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/overwriting-a-delta-table-using-dlt/m-p/75703#M35034</guid>
      <dc:creator>giuseppegrieco</dc:creator>
      <dc:date>2024-06-25T13:16:34Z</dc:date>
    </item>
  </channel>
</rss>

