cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

When would you use the Feature Store?

User16789201666
Contributor II

For example would you use a feature store on your raw data or what's is the granularity of the features in the store?

1 REPLY 1

Joseph_B
New Contributor III
New Contributor III

I'll try to answer the broad question first, followed by the specific ones.

When would you use the Feature Store?

A Feature Store is primarily used to solve 2 challenges.

(1) Discoverability and governance of features

Challenge: In a large team or organization, data scientists may create many different features over time. The sheer number of features and collaborators can make it challenging to share features, avoid repeated work, avoid inconsistent feature definitions, and manage the feature data.

Solution: The Databricks Feature Store solves this by logging feature metadata to a centralized location, allowing users and admins to search for features, sort them, discover who owns them, see how recently they were updated, etc. It also provided governance features such as access control lists (ACLs) for security, APIs for managing features, and linking features to the notebooks or jobs which produced them and to the models which are consuming them.

(2) Online/offline data skew

Challenge: Features need to be consumed in several places in an ML workflow: training, batch or streaming scoring, and online serving. Without a feature store, featurization logic may need to be copied to these multiple parts of the workflow, which can lead to different definitions or versions of features in these different parts of the workflow.

Solution: The Databricks Feature Store allows users to pull features from the same Feature Store and use them in each of these parts of the workflow: a snapshot of feature values for training, the latest set for batch or streaming scoring, or a snapshot for online serving. This avoids re-defining features and risking online-offline skew.

Would you use a feature store on your raw data or what's the granularity of the features in the store?

Good question, but I think it's best answered with another question: What input does your ML application need? Some ML models need aggregated data as input features; that data might correspond to a "gold" table in your Delta ETL pipeline. Other ML models need very raw data such as free text; that data might correspond to a "bronze" or "silver" table in your Delta pipeline.

The Feature Store should provide features in the form the ML model expects. That said, the "ML model" could do some extra feature processing in, for example, a scikit-learn pipeline or a custom MLflow model. E.g., the Feature Store might provide unnormalized numerical features, and the model might be responsible for normalizing those features.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.