topic Overwriting a delta table using DLT in Data Engineering

Overwriting a delta table using DLT

Prajwal_082 — Tue, 25 Jun 2024 10:28:28 GMT

Hello,

We are trying to ingest bunch of csv files that we receive on daily basis using DLT, we chose streaming table for this purpose since streaming table is append only records keep adding up on a daily basis which will cause multiple rows in downstream transformation, then is it possible to overwrite the data for the target table using DLT as and when we process new file.

We cannot perform Merge also as data lack's unique key.

Thanks

#DLT

Re: Overwriting a delta table using DLT

giuseppegrieco — Tue, 25 Jun 2024 10:49:02 GMT

I'm not entirely certain I understand the use case, but my suggestion would be to delete "duplicates" downstream on the consumer side of the table that received the data from the CSV. Could you provide more details on the specific criteria used to identify a duplicate record in your scenario?

Re: Overwriting a delta table using DLT

Prajwal_082 — Tue, 25 Jun 2024 12:51:02 GMT

Deleting Duplicates would not ideal case here, because duplicates shouldn't be present at the first place. To identify duplicates, you think of a simple group by on unique columns key (all though there isn't a unique key) having count greater than one.

To understand the use case better, imagine a streaming table which will be used to ingest data from a csv file on daily basis on the first day let's say a count of 100 records were inserted. and the next day we will process new file which will have new INSERTS/UPDATES/DELETS along with the old data that was inserted in the previous load (first File). So, we end up inserting portion of data twice. Now the count has been 220(assuming that we 20 new records)

Hope this is helpful.

Re: Overwriting a delta table using DLT

giuseppegrieco — Tue, 25 Jun 2024 13:16:34 GMT

In your scenario, if the data loaded on day 2 also includes all the data from day 1, you can still apply a "remove duplicates" logic. For instance, you could compute a hashdiff by hashing all the columns and use this to exclude rows you've already seen. However, I believe you first need to load all the data, regardless of whether it contains duplicates. Once loaded, you can determine which rows are duplicates. Essentially, you need to examine the data before identifying duplicates.