cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Overwriting a delta table using DLT

Prajwal_082
New Contributor

Hello,

We are trying to ingest bunch of csv files that we receive on daily basis using DLT, we chose streaming table for this purpose since streaming table is append only records keep adding up on a daily basis which will cause multiple rows in downstream transformation, then is it possible to overwrite the data for the target table using DLT as and when we process new file.

We cannot perform Merge also as data lack's unique key.

Thanks

#DLT

3 REPLIES 3

giuseppegrieco
New Contributor III

I'm not entirely certain I understand the use case, but my suggestion would be to delete "duplicates" downstream on the consumer side of the table that received the data from the CSV. Could you provide more details on the specific criteria used to identify a duplicate record in your scenario?

Prajwal_082
New Contributor

Deleting Duplicates would not ideal case here, because duplicates shouldn't be present at the first place. To identify duplicates, you think of a simple group by on unique columns key (all though there isn't a unique key) having count greater than one.

To understand the use case better, imagine a streaming table which will be used to ingest data from a csv file on daily basis on the first day let's say a count of 100 records were inserted. and the next day we will process new file which will have new INSERTS/UPDATES/DELETS along with the old data that was inserted in the previous load (first File). So, we end up inserting portion of data twice. Now the count has been 220(assuming that we 20 new records)

Hope this is helpful.

giuseppegrieco
New Contributor III

In your scenario, if the data loaded on day 2 also includes all the data from day 1, you can still apply a "remove duplicates" logic. For instance, you could compute a hashdiff by hashing all the columns and use this to exclude rows you've already seen. However, I believe you first need to load all the data, regardless of whether it contains duplicates. Once loaded, you can determine which rows are duplicates. Essentially, you need to examine the data before identifying duplicates.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!