cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How does automl classify which feature is numeric or categorical?

anonturtle
New Contributor

When running automl on its UI, it classifies a feature "local_convenience_store" as both a numeric and categorical column. This affects the result as for numeric columns a scaler is used while in a categorical column it is one hot encoded.

For context, the feature dtype on pandas is int32 and has values from 1 to 10.

Thus, I would like to ask how does automl classify which column is numeric and which column is categorical? Is it based on low-cardinality?

Thanks for viewing 🙂

1 REPLY 1

Anonymous
Not applicable

@hr then​ :

The approach taken by AutoML to classify features as numeric or categorical depends on the specific AutoML framework or library being used, as different implementations may use different methods or heuristics to make this determination.

In general, some common approaches include:

  1. Examining the data type of the feature: This is a simple and straightforward approach, where a feature with a data type of int, float or similar is considered numeric, while a feature with a string or object data type is considered categorical. However, this approach can be limited as some features may be represented as integers but are actually categorical variables (such as zip codes).
  2. Analyzing the number of unique values in the feature: A feature with a low number of unique values (e.g. less than a certain threshold) is likely to be categorical, while a feature with a high number of unique values is likely to be numeric. This approach works well for some datasets where the distinction between categorical and numeric features is clear, but it can be challenging to choose an appropriate threshold.
  3. Using domain knowledge: In some cases, the data scientist may have domain knowledge about the data and the meaning of the features that can be used to determine whether a feature is categorical or numeric.

It's worth noting that the classification of a feature as numeric or categorical can have a significant impact on the performance of machine learning models. In the case of AutoML, the specific approach used to classify features may depend on the particular algorithm being used, and how that algorithm is designed to handle different types of features.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.