cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Do One-Hot-Encoding (OHE) before or after split data to train and test dataframe

NhatHoang
Valued Contributor II

Hi,

I wonder that I should do OHE before or after I split data to build up a ML model.

Please give some advise.

1 ACCEPTED SOLUTION

Accepted Solutions

LandanG
Databricks Employee
Databricks Employee

Hi @Nhat Hoangโ€‹ ,

While not Databricks-specific, here's a good answer:

"If you perform the encoding before the split, it will lead to data leakage (train-test contamination). In this sense, you will introduce new data (integers of Label Encoders) and use it for your models thus it will affect the end predictions results (good validation scores but poor in deployment).

After the train and validation data category is already matched up, you can perform fit_transform on the train data, then only transform for the validation data - based on the encoding maps from the train data.

Almost all feature engineering like standardization, Normalisation etc should be done after the train test split. "

Additionally, if you were to run an AutoML experiment and look at the underlying notebook you should see that the data is split before encoding is applied.

View solution in original post

3 REPLIES 3

LandanG
Databricks Employee
Databricks Employee

Hi @Nhat Hoangโ€‹ ,

While not Databricks-specific, here's a good answer:

"If you perform the encoding before the split, it will lead to data leakage (train-test contamination). In this sense, you will introduce new data (integers of Label Encoders) and use it for your models thus it will affect the end predictions results (good validation scores but poor in deployment).

After the train and validation data category is already matched up, you can perform fit_transform on the train data, then only transform for the validation data - based on the encoding maps from the train data.

Almost all feature engineering like standardization, Normalisation etc should be done after the train test split. "

Additionally, if you were to run an AutoML experiment and look at the underlying notebook you should see that the data is split before encoding is applied.

NhatHoang
Valued Contributor II

Hi @Landan Georgeโ€‹ ,

Thank you very much. It is clear for me now.

5 stars support, Databricks team. :)โ€‹

LandanG
Databricks Employee
Databricks Employee

Thank you @Nhat Hoangโ€‹, I'm glad I could help

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group