Databricks Community

ManojReddy · ‎11-27-2023

Can apply_changes feature deal with deleted records in incoming source Files?
By delete I mean record is being removed (Not a soft delete with Flag).
If not, how to automate with deleting records from Bronze Streaming table based on source Files.

ManojReddy · ‎11-29-2023

APPLY CHANGES can do upserts but I have doubts regarding deleting records by key.

Does APPLY CHANGES can deletes record from bronze streaming if key is not present in source delta files.

Edthehead · ‎03-15-2024

Hi Manoj,

No. APPLY CHANGES does not delete the data from bonze if key/data is not present in source. It will delete it based on the value of some incoming field. Something like a status (="Delete") or such.

If no status can be provided from source then you will need to execute the deletes once again in each layer. Make sure that you set the skipChangeCommits flag to true so the streams ignore any deletes and updates. Streaming is append only and hence does not expect any deletes or updates in source. Link

A common example is if you are clearing out old data from the source tables, You will need to do this for all layers. DLT will not do it automatically for you.

2vinodhkumar · ‎06-22-2024

Hi Manoj,

Did you get the solution or design change for this problem. We have 200K files on to S3 bucket and when there is change in upstream app we get new feed, feed name is fixed. On DLT we should have only new records from replaced file but in dlt we have previously added records from same file name. As we don't have any status indicator on deleted records as these are events from upstream,we are unable to do apply_changes also we can't do full refresh as we have almost 200K files on one file replacement this full refresh takes time.

ManojReddy · ‎06-23-2024

Hi Vinodh,

Seems like DLT cannot handle it on its own.
I think of a solution which goes like this.

1) Maintain the copy of 200k files in a location (copied path). DLT should point to this copied path.

2)If there is any change in the incoming file. Run a process to insert the deleted records with status indicator as delete and copy this file over to copied path. For inserting the deleted records you need to compare with the file in the copied path. These you can use DLT because you have status indicator.

Basically there should be a job with runs in certain interval(ex: 10mins) which tracks the changed file based on last update date and then compare with existing file in copied path to insert the deleted records with status indicator.

Databricks Community

How to deal with delete records from the source Files in DLT .

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences