Solved: Do One-Hot-Encoding (OHE) before or after split da... - Databricks Community - 17888

Register to join the community

Machine Learning

Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.

Hi,

I wonder that I should do OHE before or after I split data to build up a ML model.

Please give some advise.

1 ACCEPTED SOLUTION

Accepted Solutions

Hi @Nhat Hoang ,

While not Databricks-specific, here's a good answer:

"If you perform the encoding before the split, it will lead to data leakage (train-test contamination). In this sense, you will introduce new data (integers of Label Encoders) and use it for your models thus it will affect the end predictions results (good validation scores but poor in deployment).

After the train and validation data category is already matched up, you can perform fit_transform on the train data, then only transform for the validation data - based on the encoding maps from the train data.

Almost all feature engineering like standardization, Normalisation etc should be done after the train test split. "

Additionally, if you were to run an AutoML experiment and look at the underlying notebook you should see that the data is split before encoding is applied.

View solution in original post

3 REPLIES 3

Hi @Nhat Hoang ,

While not Databricks-specific, here's a good answer:

"If you perform the encoding before the split, it will lead to data leakage (train-test contamination). In this sense, you will introduce new data (integers of Label Encoders) and use it for your models thus it will affect the end predictions results (good validation scores but poor in deployment).

After the train and validation data category is already matched up, you can perform fit_transform on the train data, then only transform for the validation data - based on the encoding maps from the train data.

Almost all feature engineering like standardization, Normalisation etc should be done after the train test split. "

Additionally, if you were to run an AutoML experiment and look at the underlying notebook you should see that the data is split before encoding is applied.

Hi @Landan George ,

Thank you very much. It is clear for me now.

5 stars support, Databricks team. :)

Thank you @Nhat Hoang, I'm glad I could help

Photos

Upload Upload
URL URL
Saved Photos Saved Photos

Upload location

Upload location

Add Photos to Album:

New Album

Drag here to start uploading

Drag photos here or

Tap for upload options

You must install or upgrade to the latest version of Adobe Flash Player before you can upload images.

You must be signed in to add attachments

never-displayed

Announcements

Business Intelligence in the Era of AI

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Databricks Community Champion - March 2025 - Takuya Omi

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

upload