โ11-27-2023 01:54 AM
Can apply_changes feature deal with deleted records in incoming source Files?
By delete I mean record is being removed (Not a soft delete with Flag).
If not, how to automate with deleting records from Bronze Streaming table based on source Files.
โ11-27-2023 08:36 AM
Hi @ManojReddy , Certainly! Letโs address your questions regarding change data capture (CDC) and automating record deletions in a Bronze Streaming table based on source files.
Change Data Capture (CDC) with Delta Live Tables:
Automating Record Deletions in Bronze Streaming Table:
Remember that the Bronze layer is crucial for capturing raw data changes, and thoughtful design ensures data integrity and efficient processing. ๐
โ11-29-2023 02:49 AM
APPLY CHANGES can do upserts but I have doubts regarding deleting records by key.
Does APPLY CHANGES can deletes record from bronze streaming if key is not present in source delta files.
โ03-15-2024 01:04 AM
Hi Manoj,
No. APPLY CHANGES does not delete the data from bonze if key/data is not present in source. It will delete it based on the value of some incoming field. Something like a status (="Delete") or such.
If no status can be provided from source then you will need to execute the deletes once again in each layer. Make sure that you set the skipChangeCommits flag to true so the streams ignore any deletes and updates. Streaming is append only and hence does not expect any deletes or updates in source. Link
A common example is if you are clearing out old data from the source tables, You will need to do this for all layers. DLT will not do it automatically for you.
โ06-22-2024 05:42 PM
Hi Manoj,
Did you get the solution or design change for this problem. We have 200K files on to S3 bucket and when there is change in upstream app we get new feed, feed name is fixed. On DLT we should have only new records from replaced file but in dlt we have previously added records from same file name. As we don't have any status indicator on deleted records as these are events from upstream,we are unable to do apply_changes also we can't do full refresh as we have almost 200K files on one file replacement this full refresh takes time.
โ06-23-2024 12:29 AM
Hi Vinodh,
Seems like DLT cannot handle it on its own.
I think of a solution which goes like this.
1) Maintain the copy of 200k files in a location (copied path). DLT should point to this copied path.
2)If there is any change in the incoming file. Run a process to insert the deleted records with status indicator as delete and copy this file over to copied path. For inserting the deleted records you need to compare with the file in the copied path. These you can use DLT because you have status indicator.
Basically there should be a job with runs in certain interval(ex: 10mins) which tracks the changed file based on last update date and then compare with existing file in copied path to insert the deleted records with status indicator.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group