What's the best way to implement long term data versioning?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-04-2021 11:47 AM
I'm a data scientist creating versioned ML models. For compliance reasons, I need to be able to replicate the training data for each model version.
I've seen that you can version datasets by using delta, but the default retention period is around 30 days. If I update my training data and model monthly, and want to track models (and data) over years, what is the best way for me to version my data.
Is delta an appropriate solution for this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-17-2021 10:36 PM
Delta, as you mentioned has a feature to do time travel and by default, delta tables retain the commit history for 30 days. Operations on history of the table are parallel but will become more expensive as the log size increases
Now, in this case - since updates happen only once a month , it is worth considering to increase the retention interval by setting delta.logRetentionDuration since you'd have utmost 12 updates in a year.
If the update frequency is more, consider cloning the delta table