โ08-07-2024 06:02 AM
Hi, didn't find any "reasonable" way to clean old data from DLT pipeline tables. In DLT we have used materialized views and streaming tables (scd1, append only). What is the best way to delete old data from the tables (storage size increases linearly, of course new measures comes every day and old data is not deleted... yet). Let say that we want do delete all measurement which measurement timestamp is older than one month.
โ08-23-2024 01:16 AM
This approach doesn't work with DLT, because the manual DELETE command immediately causes pipeline corruption. We have to use skipChangeCommits=True, but this option has several limitations, at least it cannot be used with apply_changes(). Is there any other approach to cleaning up old data in DLT?
โ08-23-2024 02:14 AM
Exactly, this is not a "trivial problem", one possible solution is take bronze out of DLT pipeline (to manage by yourself, for example structured streaming from source with skipChangeCommits and partitioned by year/month, what ever you want to do to make delete handling optional), then define silver layer in DLT with "full refresh/overwrite". Of course this not fit in all situations. Just gave some idea.
โ08-23-2024 02:35 AM - edited โ08-23-2024 02:37 AM
One effective way is to use partitioning in your Delta tables based on the timestamp of your data. This way, you can take advantage of Delta Lakeโs Vacuum command to remove old files and reduce the size of your storage.
also vacuum and optimise command can be run on delta live Table.
โ09-20-2024 08:16 AM
If you do a full refresh on that streaming table sourc, that should remove old data. I am assuming you are feeding this in an scd type 1 which overwrites the data.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group