Databricks Community

Prajwal_082 · ‎06-25-2024

Hello,

We are trying to ingest bunch of csv files that we receive on daily basis using DLT, we chose streaming table for this purpose since streaming table is append only records keep adding up on a daily basis which will cause multiple rows in downstream transformation, then is it possible to overwrite the data for the target table using DLT as and when we process new file.

We cannot perform Merge also as data lack's unique key.

Thanks

#DLT

giuseppegrieco · ‎06-25-2024

I'm not entirely certain I understand the use case, but my suggestion would be to delete "duplicates" downstream on the consumer side of the table that received the data from the CSV. Could you provide more details on the specific criteria used to identify a duplicate record in your scenario?

Prajwal_082 · ‎06-25-2024

Deleting Duplicates would not ideal case here, because duplicates shouldn't be present at the first place. To identify duplicates, you think of a simple group by on unique columns key (all though there isn't a unique key) having count greater than one.

To understand the use case better, imagine a streaming table which will be used to ingest data from a csv file on daily basis on the first day let's say a count of 100 records were inserted. and the next day we will process new file which will have new INSERTS/UPDATES/DELETS along with the old data that was inserted in the previous load (first File). So, we end up inserting portion of data twice. Now the count has been 220(assuming that we 20 new records)

Hope this is helpful.

giuseppegrieco · ‎06-25-2024

In your scenario, if the data loaded on day 2 also includes all the data from day 1, you can still apply a "remove duplicates" logic. For instance, you could compute a hashdiff by hashing all the columns and use this to exclude rows you've already seen. However, I believe you first need to load all the data, regardless of whether it contains duplicates. Once loaded, you can determine which rows are duplicates. Essentially, you need to examine the data before identifying duplicates.

Databricks Community

Overwriting a delta table using DLT

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences