<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Do One-Hot-Encoding (OHE) before or after split data to train and test dataframe in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/do-one-hot-encoding-ohe-before-or-after-split-data-to-train-and/m-p/17889#M988</link>
    <description>&lt;P&gt;Hi @Nhat Hoang​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;While not Databricks-specific, here's a &lt;A href="https://datascience.stackexchange.com/questions/107714/encoding-before-vs-after-train-test-split" alt="https://datascience.stackexchange.com/questions/107714/encoding-before-vs-after-train-test-split" target="_blank"&gt;good answer&lt;/A&gt;:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;"If you perform the encoding before the split, it will lead to&amp;nbsp;&lt;B&gt;data leakage&lt;/B&gt;&amp;nbsp;(train-test contamination). In this sense, you will introduce new data (integers of Label Encoders) and use it for your models thus it will affect the end predictions results (good validation scores but poor in deployment).&lt;/P&gt;&lt;P&gt;After the train and validation data category is already matched up, you can perform fit_transform on the train data, then only transform for the validation data - based on the encoding maps from the train data.&lt;/P&gt;&lt;P&gt;Almost all feature engineering like standardization, Normalisation etc should be done after the train test split. "&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Additionally, if you were to run an AutoML experiment and look at the underlying notebook you should see that the data is split before encoding is applied.&lt;/P&gt;</description>
    <pubDate>Thu, 08 Dec 2022 14:27:47 GMT</pubDate>
    <dc:creator>LandanG</dc:creator>
    <dc:date>2022-12-08T14:27:47Z</dc:date>
    <item>
      <title>Do One-Hot-Encoding (OHE) before or after split data to train and test dataframe</title>
      <link>https://community.databricks.com/t5/machine-learning/do-one-hot-encoding-ohe-before-or-after-split-data-to-train-and/m-p/17888#M987</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I wonder that I should do OHE before or after I split data to build up a ML model.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Please give some advise.&lt;/P&gt;</description>
      <pubDate>Thu, 08 Dec 2022 08:52:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/do-one-hot-encoding-ohe-before-or-after-split-data-to-train-and/m-p/17888#M987</guid>
      <dc:creator>NhatHoang</dc:creator>
      <dc:date>2022-12-08T08:52:29Z</dc:date>
    </item>
    <item>
      <title>Re: Do One-Hot-Encoding (OHE) before or after split data to train and test dataframe</title>
      <link>https://community.databricks.com/t5/machine-learning/do-one-hot-encoding-ohe-before-or-after-split-data-to-train-and/m-p/17889#M988</link>
      <description>&lt;P&gt;Hi @Nhat Hoang​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;While not Databricks-specific, here's a &lt;A href="https://datascience.stackexchange.com/questions/107714/encoding-before-vs-after-train-test-split" alt="https://datascience.stackexchange.com/questions/107714/encoding-before-vs-after-train-test-split" target="_blank"&gt;good answer&lt;/A&gt;:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;"If you perform the encoding before the split, it will lead to&amp;nbsp;&lt;B&gt;data leakage&lt;/B&gt;&amp;nbsp;(train-test contamination). In this sense, you will introduce new data (integers of Label Encoders) and use it for your models thus it will affect the end predictions results (good validation scores but poor in deployment).&lt;/P&gt;&lt;P&gt;After the train and validation data category is already matched up, you can perform fit_transform on the train data, then only transform for the validation data - based on the encoding maps from the train data.&lt;/P&gt;&lt;P&gt;Almost all feature engineering like standardization, Normalisation etc should be done after the train test split. "&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Additionally, if you were to run an AutoML experiment and look at the underlying notebook you should see that the data is split before encoding is applied.&lt;/P&gt;</description>
      <pubDate>Thu, 08 Dec 2022 14:27:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/do-one-hot-encoding-ohe-before-or-after-split-data-to-train-and/m-p/17889#M988</guid>
      <dc:creator>LandanG</dc:creator>
      <dc:date>2022-12-08T14:27:47Z</dc:date>
    </item>
    <item>
      <title>Re: Do One-Hot-Encoding (OHE) before or after split data to train and test dataframe</title>
      <link>https://community.databricks.com/t5/machine-learning/do-one-hot-encoding-ohe-before-or-after-split-data-to-train-and/m-p/17890#M989</link>
      <description>&lt;P&gt;Hi @Landan George​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you very much. It is clear for me now.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;5 stars support, Databricks team. :)​&lt;/P&gt;</description>
      <pubDate>Fri, 09 Dec 2022 02:23:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/do-one-hot-encoding-ohe-before-or-after-split-data-to-train-and/m-p/17890#M989</guid>
      <dc:creator>NhatHoang</dc:creator>
      <dc:date>2022-12-09T02:23:02Z</dc:date>
    </item>
    <item>
      <title>Re: Do One-Hot-Encoding (OHE) before or after split data to train and test dataframe</title>
      <link>https://community.databricks.com/t5/machine-learning/do-one-hot-encoding-ohe-before-or-after-split-data-to-train-and/m-p/17891#M990</link>
      <description>&lt;P&gt;Thank you @Nhat Hoang​, I'm glad I could help&lt;/P&gt;</description>
      <pubDate>Fri, 09 Dec 2022 14:01:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/do-one-hot-encoding-ohe-before-or-after-split-data-to-train-and/m-p/17891#M990</guid>
      <dc:creator>LandanG</dc:creator>
      <dc:date>2022-12-09T14:01:56Z</dc:date>
    </item>
  </channel>
</rss>

