feature store

pargit2 — Wed, 04 Jun 2025 07:58:38 GMT

i need to build for data science team feature store that will return one big df after one hot encoding for almost each dimension,join and group by.

should I create one feature store for final output that contain all the relevant data or create feature store for each dim and do the joi and aggregation in feature store fir final output?

I worry since some of the columns indicate doctor name (hot encoding) and they can change once in a while.

what is the best practice?

thanks it can really help me and it seems you have ton of experience

Re: feature store

Louis_Frolio — Wed, 04 Jun 2025 11:31:06 GMT

Here are some things to consider:

The best practice for designing a feature store in your scenario depends on balancing scalability, maintainability, and the dynamic nature of some dimensions like doctor names. Here's an outlined recommendation based on the provided context:

Separate Feature Tables for Each Dimension:
- Create feature tables for each dimension rather than one monolithic feature table containing all data. This approach can improve update performance, especially for fast-changing features like doctor names, as only specific tables need updates rather than the entire dataset.
- Feature tables can be efficiently managed by defining them with primary keys that identify entities uniquely, such as doctor IDs or patient IDs.
Dynamic Feature Handling and One-Hot Encoding:
- To handle dynamic columns (like doctor names), you can implement logic to redefine or backfill features when dimensions change. For example, introduce new columns or update existing ones as needed.
- One-hot encoding can be applied for categorical dimensions, but be cautious of cardinality. High-cardinality columns might increase dimensionality and reduce efficiency in models. Consider alternatives to one-hot encoding for such cases, such as embeddings.
Joins and Aggregates at the Feature Store Level:
- Perform joins and aggregations at the feature store level rather than delegating these tasks to models. This centralizes transformations and ensures alignment for both training and inference, eliminating potential offline/online data skew.
Time-Series Features and Versioning:
- For features that vary over time, create time-series feature tables and use techniques like "as-of" joins to retrieve values relevant to specific timestamps. Consider using Delta Lake's built-in versioning for auditability and historical retrievability.
Governance and Collaboration:
- Enforce best practices such as keeping features discoverable, reusable, and accessible for all teams while ensuring automated lineage tracking for governance. Separate sensitive features (e.g., patient data) from less sensitive ones for better control.

Overall, splitting features into separate tables and handling dynamic dimensions programmatically ensures scalability and maintainability. Centralizing joins and aggregations avoids redundant design at the model level, while governance safeguards transformations against future changes. This approach aligns with Databricks Feature Store best practices while addressing your concerns about evolving dimensions like doctor names.

Cheers, Lou.

Re: feature store

pargit2 — Wed, 04 Jun 2025 18:05:40 GMT

hi Lou,

thanks for your reply.

just to verify you suggest building a feature store for each dim and fact and then one united(join everything up)?

and you suggest to avoid one hot encoding and use embedding instead.

I hope I got you right.

and one more thing I work on data engineering workspace and our data science team has a different workspace.

should I build only the tables (bronze,silver,gold)in de workspace and build the feature stores in their workspace?

I understand that I can't share feature store using delta sharing,Is there a way to share it?

what is your recommendation?

thanks

Adi

topic Re: feature store in Get Started Discussions

feature store

Re: feature store

Re: feature store