cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Overwriting a delta table using DLT

Prajwal_082
New Contributor II

Hello,

We are trying to ingest bunch of csv files that we receive on daily basis using DLT, we chose streaming table for this purpose since streaming table is append only records keep adding up on a daily basis which will cause multiple rows in downstream transformation, then is it possible to overwrite the data for the target table using DLT as and when we process new file.

We cannot perform Merge also as data lack's unique key.

Thanks

#DLT

3 REPLIES 3

giuseppegrieco
New Contributor III

I'm not entirely certain I understand the use case, but my suggestion would be to delete "duplicates" downstream on the consumer side of the table that received the data from the CSV. Could you provide more details on the specific criteria used to identify a duplicate record in your scenario?

Prajwal_082
New Contributor II

Deleting Duplicates would not ideal case here, because duplicates shouldn't be present at the first place. To identify duplicates, you think of a simple group by on unique columns key (all though there isn't a unique key) having count greater than one.

To understand the use case better, imagine a streaming table which will be used to ingest data from a csv file on daily basis on the first day let's say a count of 100 records were inserted. and the next day we will process new file which will have new INSERTS/UPDATES/DELETS along with the old data that was inserted in the previous load (first File). So, we end up inserting portion of data twice. Now the count has been 220(assuming that we 20 new records)

Hope this is helpful.

giuseppegrieco
New Contributor III

In your scenario, if the data loaded on day 2 also includes all the data from day 1, you can still apply a "remove duplicates" logic. For instance, you could compute a hashdiff by hashing all the columns and use this to exclude rows you've already seen. However, I believe you first need to load all the data, regardless of whether it contains duplicates. Once loaded, you can determine which rows are duplicates. Essentially, you need to examine the data before identifying duplicates.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group