cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

What's the best way to implement long term data versioning?

User16752240150
New Contributor II

I'm a data scientist creating versioned ML models. For compliance reasons, I need to be able to replicate the training data for each model version.

I've seen that you can version datasets by using delta, but the default retention period is around 30 days. If I update my training data and model monthly, and want to track models (and data) over years, what is the best way for me to version my data.

Is delta an appropriate solution for this?

1 REPLY 1

sajith_appukutt
Honored Contributor II

Delta, as you mentioned has a feature to do time travel and by default, delta tables retain the commit history for 30 days. Operations on history of the table are parallel but will become more expensive as the log size increases

Now, in this case - since updates happen only once a month , it is worth considering to increase the retention interval by setting delta.logRetentionDuration since you'd have utmost 12 updates in a year.

If the update frequency is more, consider cloning the delta table

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group