<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Feature store and medallion data location in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/feature-store-and-medallion-data-location/m-p/108484#M3937</link>
    <description>&lt;P&gt;Hey &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/145378"&gt;@SDN&lt;/a&gt;&amp;nbsp;!&lt;/P&gt;&lt;P class=""&gt;My recommendation is to work with &lt;SPAN class=""&gt;&lt;STRONG&gt;three separate workspaces&lt;/STRONG&gt;&lt;/SPAN&gt; (dev, preprod, prod). While this approach is more complex in terms of infrastructure, it provides &lt;SPAN class=""&gt;&lt;STRONG&gt;better stability and fewer issues in the long run&lt;/STRONG&gt;&lt;/SPAN&gt; by ensuring clear separation between development and production environments.&lt;/P&gt;&lt;P class=""&gt;Each workspace should have its own &lt;SPAN class=""&gt;&lt;STRONG&gt;dedicated catalog (dev, pre, prod)&lt;/STRONG&gt;&lt;/SPAN&gt;. However, it is recommended to &lt;SPAN class=""&gt;&lt;STRONG&gt;allow read-only access from dev and preprod to the prod environment&lt;/STRONG&gt;&lt;/SPAN&gt;. This setup enables developers to work with &lt;SPAN class=""&gt;&lt;STRONG&gt;either real production data or non-production data&lt;/STRONG&gt;&lt;/SPAN&gt; for testing purposes while ensuring that no unintended modifications affect the production environment.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Data Sharing Between Environments&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;To move data between environments, you can use:&lt;/P&gt;&lt;P class=""&gt;•&lt;SPAN class=""&gt;&lt;STRONG&gt;Deep Clone&lt;/STRONG&gt;&lt;/SPAN&gt; (&lt;SPAN class=""&gt;DEEP CLONE&lt;/SPAN&gt;) → Preserves the Delta table history and metadata.&lt;/P&gt;&lt;P class=""&gt;•&lt;SPAN class=""&gt;&lt;STRONG&gt;CTAS (CREATE TABLE AS SELECT)&lt;/STRONG&gt;&lt;/SPAN&gt; → Creates a new table but does not retain version history.&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Feature Store Management&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;From what I understand, you want to &lt;SPAN class=""&gt;&lt;STRONG&gt;create a centralized Feature Store (CFS)&lt;/STRONG&gt;&lt;/SPAN&gt;. However, a Feature Store is essentially &lt;SPAN class=""&gt;&lt;STRONG&gt;a set of tables&lt;/STRONG&gt;&lt;/SPAN&gt;, which you can store in a dedicated schema within the &lt;SPAN class=""&gt;&lt;STRONG&gt;production catalog&lt;/STRONG&gt;&lt;/SPAN&gt;.&lt;/P&gt;&lt;P class=""&gt;Since Feature Stores are derived from transformed data, they should be built using the &lt;STRONG&gt;Silver/&lt;/STRONG&gt;&lt;SPAN class=""&gt;&lt;STRONG&gt;Gold layer&lt;/STRONG&gt;&lt;/SPAN&gt; of your Medallion architecture. It is also important to decide whether:&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;1. &lt;/SPAN&gt;&lt;STRONG&gt;Features should be extracted directly from Silver/Gold tables&lt;/STRONG&gt;&lt;SPAN class=""&gt;, or&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;2. &lt;/SPAN&gt;&lt;STRONG&gt;A separate pipeline should process and store feature data independently&lt;/STRONG&gt;&lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;If a &lt;/SPAN&gt;&lt;STRONG&gt;single Feature Store is shared across all environments&lt;/STRONG&gt;&lt;SPAN class=""&gt;, ensure that:&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;•Feature engineering is performed in &lt;SPAN class=""&gt;&lt;STRONG&gt;dev&lt;/STRONG&gt;&lt;/SPAN&gt; before promoting features to prod.&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;• Feature Store updates follow a &lt;/SPAN&gt;&lt;STRONG&gt;controlled deployment process (CI/CD)&lt;/STRONG&gt;&lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;• Read and write permissions are well-defined to prevent &lt;/SPAN&gt;&lt;STRONG&gt;dev/preprod from accidentally overwriting production features&lt;/STRONG&gt;&lt;SPAN class=""&gt;.&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;ETL Pipeline Considerations&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;The &lt;SPAN class=""&gt;&lt;STRONG&gt;ETL pipeline that processes data from raw to gold&lt;/STRONG&gt;&lt;/SPAN&gt; should run &lt;SPAN class=""&gt;&lt;STRONG&gt;in the production workspace&lt;/STRONG&gt;&lt;/SPAN&gt; to ensure a &lt;SPAN class=""&gt;&lt;STRONG&gt;single source of truth&lt;/STRONG&gt;&lt;/SPAN&gt;. This setup prevents inconsistencies and duplication of processing logic across environments.&lt;/P&gt;&lt;P class=""&gt;However, &lt;SPAN class=""&gt;&lt;STRONG&gt;development and testing should be done in dev/preprod&lt;/STRONG&gt;&lt;/SPAN&gt;, and only tested pipelines should be deployed to prod.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Security and Access Control&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;To enforce &lt;/SPAN&gt;&lt;STRONG&gt;controlled access to production data&lt;/STRONG&gt;&lt;SPAN class=""&gt;, consider using:&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;•&lt;SPAN class=""&gt;&lt;STRONG&gt;Unity Catalog&lt;/STRONG&gt;&lt;/SPAN&gt; for centralized permission management.&lt;/P&gt;&lt;P class=""&gt;•&lt;SPAN class=""&gt;&lt;STRONG&gt;External Locations&lt;/STRONG&gt;&lt;/SPAN&gt; for secure data sharing.&lt;BR /&gt;&lt;BR /&gt;Hope that helps &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 03 Feb 2025 00:00:46 GMT</pubDate>
    <dc:creator>Isi</dc:creator>
    <dc:date>2025-02-03T00:00:46Z</dc:date>
    <item>
      <title>Feature store and medallion data location</title>
      <link>https://community.databricks.com/t5/machine-learning/feature-store-and-medallion-data-location/m-p/107672#M3931</link>
      <description>&lt;P&gt;Hello Folks,&lt;/P&gt;&lt;P&gt;If we have 3 environments (dev/preprod/prod) and would like to have medallion data shared among them - I guess delta share is a good way to go. Now if we want to use "Feature Store (FS)" then I am a bit confused and seeking some clarity. I want to build one FS instead of 3 separate ones. So, it is a "Centralized FS (CFS)". Now where do we keep the actual data (raw to refined/aggregated OR rather bronze to gold) and how and where do we build this CFS?&amp;nbsp;&lt;BR /&gt;I think it has to be outside of the 3 environments - right?&lt;BR /&gt;&lt;BR /&gt;So where will be the medallion data and where will be the CFS and how do you propose that data scientists working in the 3 environments connect to this CFS?&lt;/P&gt;&lt;P&gt;Also, how and where do I create the feature set to be stored in feature store? I believe this will be done in lowest environment, meaning dev environment?? And so the dev environment will need to get data from bronze-gold and then do feature engineering and select best features and then save this to CFS?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Jan 2025 23:50:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/feature-store-and-medallion-data-location/m-p/107672#M3931</guid>
      <dc:creator>SDN</dc:creator>
      <dc:date>2025-01-29T23:50:14Z</dc:date>
    </item>
    <item>
      <title>Re: Feature store and medallion data location</title>
      <link>https://community.databricks.com/t5/machine-learning/feature-store-and-medallion-data-location/m-p/108484#M3937</link>
      <description>&lt;P&gt;Hey &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/145378"&gt;@SDN&lt;/a&gt;&amp;nbsp;!&lt;/P&gt;&lt;P class=""&gt;My recommendation is to work with &lt;SPAN class=""&gt;&lt;STRONG&gt;three separate workspaces&lt;/STRONG&gt;&lt;/SPAN&gt; (dev, preprod, prod). While this approach is more complex in terms of infrastructure, it provides &lt;SPAN class=""&gt;&lt;STRONG&gt;better stability and fewer issues in the long run&lt;/STRONG&gt;&lt;/SPAN&gt; by ensuring clear separation between development and production environments.&lt;/P&gt;&lt;P class=""&gt;Each workspace should have its own &lt;SPAN class=""&gt;&lt;STRONG&gt;dedicated catalog (dev, pre, prod)&lt;/STRONG&gt;&lt;/SPAN&gt;. However, it is recommended to &lt;SPAN class=""&gt;&lt;STRONG&gt;allow read-only access from dev and preprod to the prod environment&lt;/STRONG&gt;&lt;/SPAN&gt;. This setup enables developers to work with &lt;SPAN class=""&gt;&lt;STRONG&gt;either real production data or non-production data&lt;/STRONG&gt;&lt;/SPAN&gt; for testing purposes while ensuring that no unintended modifications affect the production environment.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Data Sharing Between Environments&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;To move data between environments, you can use:&lt;/P&gt;&lt;P class=""&gt;•&lt;SPAN class=""&gt;&lt;STRONG&gt;Deep Clone&lt;/STRONG&gt;&lt;/SPAN&gt; (&lt;SPAN class=""&gt;DEEP CLONE&lt;/SPAN&gt;) → Preserves the Delta table history and metadata.&lt;/P&gt;&lt;P class=""&gt;•&lt;SPAN class=""&gt;&lt;STRONG&gt;CTAS (CREATE TABLE AS SELECT)&lt;/STRONG&gt;&lt;/SPAN&gt; → Creates a new table but does not retain version history.&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Feature Store Management&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;From what I understand, you want to &lt;SPAN class=""&gt;&lt;STRONG&gt;create a centralized Feature Store (CFS)&lt;/STRONG&gt;&lt;/SPAN&gt;. However, a Feature Store is essentially &lt;SPAN class=""&gt;&lt;STRONG&gt;a set of tables&lt;/STRONG&gt;&lt;/SPAN&gt;, which you can store in a dedicated schema within the &lt;SPAN class=""&gt;&lt;STRONG&gt;production catalog&lt;/STRONG&gt;&lt;/SPAN&gt;.&lt;/P&gt;&lt;P class=""&gt;Since Feature Stores are derived from transformed data, they should be built using the &lt;STRONG&gt;Silver/&lt;/STRONG&gt;&lt;SPAN class=""&gt;&lt;STRONG&gt;Gold layer&lt;/STRONG&gt;&lt;/SPAN&gt; of your Medallion architecture. It is also important to decide whether:&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;1. &lt;/SPAN&gt;&lt;STRONG&gt;Features should be extracted directly from Silver/Gold tables&lt;/STRONG&gt;&lt;SPAN class=""&gt;, or&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;2. &lt;/SPAN&gt;&lt;STRONG&gt;A separate pipeline should process and store feature data independently&lt;/STRONG&gt;&lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;If a &lt;/SPAN&gt;&lt;STRONG&gt;single Feature Store is shared across all environments&lt;/STRONG&gt;&lt;SPAN class=""&gt;, ensure that:&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;•Feature engineering is performed in &lt;SPAN class=""&gt;&lt;STRONG&gt;dev&lt;/STRONG&gt;&lt;/SPAN&gt; before promoting features to prod.&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;• Feature Store updates follow a &lt;/SPAN&gt;&lt;STRONG&gt;controlled deployment process (CI/CD)&lt;/STRONG&gt;&lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;• Read and write permissions are well-defined to prevent &lt;/SPAN&gt;&lt;STRONG&gt;dev/preprod from accidentally overwriting production features&lt;/STRONG&gt;&lt;SPAN class=""&gt;.&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;ETL Pipeline Considerations&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;The &lt;SPAN class=""&gt;&lt;STRONG&gt;ETL pipeline that processes data from raw to gold&lt;/STRONG&gt;&lt;/SPAN&gt; should run &lt;SPAN class=""&gt;&lt;STRONG&gt;in the production workspace&lt;/STRONG&gt;&lt;/SPAN&gt; to ensure a &lt;SPAN class=""&gt;&lt;STRONG&gt;single source of truth&lt;/STRONG&gt;&lt;/SPAN&gt;. This setup prevents inconsistencies and duplication of processing logic across environments.&lt;/P&gt;&lt;P class=""&gt;However, &lt;SPAN class=""&gt;&lt;STRONG&gt;development and testing should be done in dev/preprod&lt;/STRONG&gt;&lt;/SPAN&gt;, and only tested pipelines should be deployed to prod.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Security and Access Control&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;To enforce &lt;/SPAN&gt;&lt;STRONG&gt;controlled access to production data&lt;/STRONG&gt;&lt;SPAN class=""&gt;, consider using:&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;•&lt;SPAN class=""&gt;&lt;STRONG&gt;Unity Catalog&lt;/STRONG&gt;&lt;/SPAN&gt; for centralized permission management.&lt;/P&gt;&lt;P class=""&gt;•&lt;SPAN class=""&gt;&lt;STRONG&gt;External Locations&lt;/STRONG&gt;&lt;/SPAN&gt; for secure data sharing.&lt;BR /&gt;&lt;BR /&gt;Hope that helps &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 03 Feb 2025 00:00:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/feature-store-and-medallion-data-location/m-p/108484#M3937</guid>
      <dc:creator>Isi</dc:creator>
      <dc:date>2025-02-03T00:00:46Z</dc:date>
    </item>
    <item>
      <title>Re: Feature store and medallion data location</title>
      <link>https://community.databricks.com/t5/machine-learning/feature-store-and-medallion-data-location/m-p/109300#M3954</link>
      <description>&lt;P&gt;Thank you so much for the detailed answer!&lt;/P&gt;</description>
      <pubDate>Thu, 06 Feb 2025 20:34:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/feature-store-and-medallion-data-location/m-p/109300#M3954</guid>
      <dc:creator>SDN</dc:creator>
      <dc:date>2025-02-06T20:34:44Z</dc:date>
    </item>
  </channel>
</rss>

