Databricks Community

User16789201666 · ‎06-07-2021

For example would you use a feature store on your raw data or what's is the granularity of the features in the store?

Joseph_B · ‎06-18-2021

I'll try to answer the broad question first, followed by the specific ones.

When would you use the Feature Store?

A Feature Store is primarily used to solve 2 challenges.

(1) Discoverability and governance of features

Challenge: In a large team or organization, data scientists may create many different features over time. The sheer number of features and collaborators can make it challenging to share features, avoid repeated work, avoid inconsistent feature definitions, and manage the feature data.

Solution: The Databricks Feature Store solves this by logging feature metadata to a centralized location, allowing users and admins to search for features, sort them, discover who owns them, see how recently they were updated, etc. It also provided governance features such as access control lists (ACLs) for security, APIs for managing features, and linking features to the notebooks or jobs which produced them and to the models which are consuming them.

(2) Online/offline data skew

Challenge: Features need to be consumed in several places in an ML workflow: training, batch or streaming scoring, and online serving. Without a feature store, featurization logic may need to be copied to these multiple parts of the workflow, which can lead to different definitions or versions of features in these different parts of the workflow.

Solution: The Databricks Feature Store allows users to pull features from the same Feature Store and use them in each of these parts of the workflow: a snapshot of feature values for training, the latest set for batch or streaming scoring, or a snapshot for online serving. This avoids re-defining features and risking online-offline skew.

Would you use a feature store on your raw data or what's the granularity of the features in the store?

Good question, but I think it's best answered with another question: What input does your ML application need? Some ML models need aggregated data as input features; that data might correspond to a "gold" table in your Delta ETL pipeline. Other ML models need very raw data such as free text; that data might correspond to a "bronze" or "silver" table in your Delta pipeline.

The Feature Store should provide features in the form the ML model expects. That said, the "ML model" could do some extra feature processing in, for example, a scikit-learn pipeline or a custom MLflow model. E.g., the Feature Store might provide unnormalized numerical features, and the model might be responsible for normalizing those features.

Databricks Community

When would you use the Feature Store?

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! October 31 – November 06, 2025

Free Edition Hackathon

🚀 Announcing the Databricks Data Intelligence Platform Cheat Sheet

Zerobus Ingest in Action: How to Stream Event Data Directly into Your Lakehouse

Find Sensitive Data at Scale with Data Classification in Unity Catalog