cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

What's the best way to implement long term data versioning?

User16752240150
New Contributor II

I'm a data scientist creating versioned ML models. For compliance reasons, I need to be able to replicate the training data for each model version.

I've seen that you can version datasets by using delta, but the default retention period is around 30 days. If I update my training data and model monthly, and want to track models (and data) over years, what is the best way for me to version my data.

Is delta an appropriate solution for this?

1 REPLY 1

sajith_appukutt
Honored Contributor II

Delta, as you mentioned has a feature to do time travel and by default, delta tables retain the commit history for 30 days. Operations on history of the table are parallel but will become more expensive as the log size increases

Now, in this case - since updates happen only once a month , it is worth considering to increase the retention interval by setting delta.logRetentionDuration since you'd have utmost 12 updates in a year.

If the update frequency is more, consider cloning the delta table

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.