cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

When would you use the Feature Store?

User16789201666
Databricks Employee
Databricks Employee

For example would you use a feature store on your raw data or what's is the granularity of the features in the store?

1 REPLY 1

Joseph_B
Databricks Employee
Databricks Employee

I'll try to answer the broad question first, followed by the specific ones.

When would you use the Feature Store?

A Feature Store is primarily used to solve 2 challenges.

(1) Discoverability and governance of features

Challenge: In a large team or organization, data scientists may create many different features over time. The sheer number of features and collaborators can make it challenging to share features, avoid repeated work, avoid inconsistent feature definitions, and manage the feature data.

Solution: The Databricks Feature Store solves this by logging feature metadata to a centralized location, allowing users and admins to search for features, sort them, discover who owns them, see how recently they were updated, etc. It also provided governance features such as access control lists (ACLs) for security, APIs for managing features, and linking features to the notebooks or jobs which produced them and to the models which are consuming them.

(2) Online/offline data skew

Challenge: Features need to be consumed in several places in an ML workflow: training, batch or streaming scoring, and online serving. Without a feature store, featurization logic may need to be copied to these multiple parts of the workflow, which can lead to different definitions or versions of features in these different parts of the workflow.

Solution: The Databricks Feature Store allows users to pull features from the same Feature Store and use them in each of these parts of the workflow: a snapshot of feature values for training, the latest set for batch or streaming scoring, or a snapshot for online serving. This avoids re-defining features and risking online-offline skew.

Would you use a feature store on your raw data or what's the granularity of the features in the store?

Good question, but I think it's best answered with another question: What input does your ML application need? Some ML models need aggregated data as input features; that data might correspond to a "gold" table in your Delta ETL pipeline. Other ML models need very raw data such as free text; that data might correspond to a "bronze" or "silver" table in your Delta pipeline.

The Feature Store should provide features in the form the ML model expects. That said, the "ML model" could do some extra feature processing in, for example, a scikit-learn pipeline or a custom MLflow model. E.g., the Feature Store might provide unnormalized numerical features, and the model might be responsible for normalizing those features.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group