cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Feature store and medallion data location

SDN
New Contributor II

Hello Folks,

If we have 3 environments (dev/preprod/prod) and would like to have medallion data shared among them - I guess delta share is a good way to go. Now if we want to use "Feature Store (FS)" then I am a bit confused and seeking some clarity. I want to build one FS instead of 3 separate ones. So, it is a "Centralized FS (CFS)". Now where do we keep the actual data (raw to refined/aggregated OR rather bronze to gold) and how and where do we build this CFS? 
I think it has to be outside of the 3 environments - right?

So where will be the medallion data and where will be the CFS and how do you propose that data scientists working in the 3 environments connect to this CFS?

Also, how and where do I create the feature set to be stored in feature store? I believe this will be done in lowest environment, meaning dev environment?? And so the dev environment will need to get data from bronze-gold and then do feature engineering and select best features and then save this to CFS?

 

1 ACCEPTED SOLUTION

Accepted Solutions

Isi
New Contributor III

Hey @SDN !

My recommendation is to work with three separate workspaces (dev, preprod, prod). While this approach is more complex in terms of infrastructure, it provides better stability and fewer issues in the long run by ensuring clear separation between development and production environments.

Each workspace should have its own dedicated catalog (dev, pre, prod). However, it is recommended to allow read-only access from dev and preprod to the prod environment. This setup enables developers to work with either real production data or non-production data for testing purposes while ensuring that no unintended modifications affect the production environment.

Data Sharing Between Environments

To move data between environments, you can use:

Deep Clone (DEEP CLONE) → Preserves the Delta table history and metadata.

CTAS (CREATE TABLE AS SELECT) → Creates a new table but does not retain version history.

 

Feature Store Management

From what I understand, you want to create a centralized Feature Store (CFS). However, a Feature Store is essentially a set of tables, which you can store in a dedicated schema within the production catalog.

Since Feature Stores are derived from transformed data, they should be built using the Silver/Gold layer of your Medallion architecture. It is also important to decide whether:

1. Features should be extracted directly from Silver/Gold tables, or

2. A separate pipeline should process and store feature data independently.

If a single Feature Store is shared across all environments, ensure that:

•Feature engineering is performed in dev before promoting features to prod.

• Feature Store updates follow a controlled deployment process (CI/CD).

• Read and write permissions are well-defined to prevent dev/preprod from accidentally overwriting production features.

ETL Pipeline Considerations

The ETL pipeline that processes data from raw to gold should run in the production workspace to ensure a single source of truth. This setup prevents inconsistencies and duplication of processing logic across environments.

However, development and testing should be done in dev/preprod, and only tested pipelines should be deployed to prod.

Security and Access Control

To enforce controlled access to production data, consider using:

Unity Catalog for centralized permission management.

External Locations for secure data sharing.

Hope that helps 🙂

View solution in original post

2 REPLIES 2

Isi
New Contributor III

Hey @SDN !

My recommendation is to work with three separate workspaces (dev, preprod, prod). While this approach is more complex in terms of infrastructure, it provides better stability and fewer issues in the long run by ensuring clear separation between development and production environments.

Each workspace should have its own dedicated catalog (dev, pre, prod). However, it is recommended to allow read-only access from dev and preprod to the prod environment. This setup enables developers to work with either real production data or non-production data for testing purposes while ensuring that no unintended modifications affect the production environment.

Data Sharing Between Environments

To move data between environments, you can use:

Deep Clone (DEEP CLONE) → Preserves the Delta table history and metadata.

CTAS (CREATE TABLE AS SELECT) → Creates a new table but does not retain version history.

 

Feature Store Management

From what I understand, you want to create a centralized Feature Store (CFS). However, a Feature Store is essentially a set of tables, which you can store in a dedicated schema within the production catalog.

Since Feature Stores are derived from transformed data, they should be built using the Silver/Gold layer of your Medallion architecture. It is also important to decide whether:

1. Features should be extracted directly from Silver/Gold tables, or

2. A separate pipeline should process and store feature data independently.

If a single Feature Store is shared across all environments, ensure that:

•Feature engineering is performed in dev before promoting features to prod.

• Feature Store updates follow a controlled deployment process (CI/CD).

• Read and write permissions are well-defined to prevent dev/preprod from accidentally overwriting production features.

ETL Pipeline Considerations

The ETL pipeline that processes data from raw to gold should run in the production workspace to ensure a single source of truth. This setup prevents inconsistencies and duplication of processing logic across environments.

However, development and testing should be done in dev/preprod, and only tested pipelines should be deployed to prod.

Security and Access Control

To enforce controlled access to production data, consider using:

Unity Catalog for centralized permission management.

External Locations for secure data sharing.

Hope that helps 🙂

SDN
New Contributor II

Thank you so much for the detailed answer!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group