cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

What's the best way to implement long term data versioning?

User16752240150
New Contributor II

I'm a data scientist creating versioned ML models. For compliance reasons, I need to be able to replicate the training data for each model version.

I've seen that you can version datasets by using delta, but the default retention period is around 30 days. If I update my training data and model monthly, and want to track models (and data) over years, what is the best way for me to version my data.

Is delta an appropriate solution for this?

1 REPLY 1

sajith_appukutt
Honored Contributor II

Delta, as you mentioned has a feature to do time travel and by default, delta tables retain the commit history for 30 days. Operations on history of the table are parallel but will become more expensive as the log size increases

Now, in this case - since updates happen only once a month , it is worth considering to increase the retention interval by setting delta.logRetentionDuration since you'd have utmost 12 updates in a year.

If the update frequency is more, consider cloning the delta table

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!