cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to deal with delete records from the source Files in DLT .

ManojReddy
New Contributor II

Can apply_changes feature deal with deleted records in incoming source Files?
By delete I mean record is being removed (Not a soft delete with Flag).
If not, how to automate with deleting records from Bronze Streaming table based on source Files.

4 REPLIES 4

APPLY CHANGES can do upserts but I have doubts regarding deleting records by key.

Does APPLY CHANGES can deletes record from bronze streaming if key is not present in source delta files.

 

Hi Manoj,

No. APPLY CHANGES does not delete the data from bonze if key/data is not present in source. It will delete it based on the value of some incoming field. Something like a status (="Delete") or such. 

If no status can be provided from source then you will need to execute the deletes once again in each layer. Make sure that you set the skipChangeCommits flag to true so the streams ignore any deletes and updates. Streaming is append only and hence does not expect any deletes or updates in source.  Link

A common example is if you are clearing out old data from the source tables, You will need to do this for all layers. DLT will not do it automatically for you. 

2vinodhkumar
New Contributor II

Hi Manoj,

Did you get the solution or design change for this problem. We have 200K files on to S3 bucket and when there is change in upstream app we get new feed, feed name is fixed. On DLT we should have only new records from replaced file but in dlt we have previously added records from same file name. As we don't have any status indicator on deleted records as these are events from upstream,we are unable to do apply_changes also we can't do full refresh as we have almost 200K files on one file replacement this full refresh takes time.

Hi Vinodh,

Seems like DLT cannot handle it on its own. 
I think of a solution which goes like this.

1) Maintain the copy of 200k files in a location (copied path). DLT should point to this copied path.

2)If there is any change in the incoming file. Run a process to insert the deleted records with status indicator as delete and copy this file over to copied path.  For inserting the deleted records you need to compare with the file in the copied path. These you can use DLT because you have status indicator.

Basically there should be a job with runs in certain interval(ex: 10mins) which tracks the changed file based on last update date and then compare with existing file in copied path to insert the deleted records with status indicator.

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group