cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How does automl classify which feature is numeric or categorical?

anonturtle
New Contributor

When running automl on its UI, it classifies a feature "local_convenience_store" as both a numeric and categorical column. This affects the result as for numeric columns a scaler is used while in a categorical column it is one hot encoded.

For context, the feature dtype on pandas is int32 and has values from 1 to 10.

Thus, I would like to ask how does automl classify which column is numeric and which column is categorical? Is it based on low-cardinality?

Thanks for viewing 🙂

1 REPLY 1

Anonymous
Not applicable

@hr then​ :

The approach taken by AutoML to classify features as numeric or categorical depends on the specific AutoML framework or library being used, as different implementations may use different methods or heuristics to make this determination.

In general, some common approaches include:

  1. Examining the data type of the feature: This is a simple and straightforward approach, where a feature with a data type of int, float or similar is considered numeric, while a feature with a string or object data type is considered categorical. However, this approach can be limited as some features may be represented as integers but are actually categorical variables (such as zip codes).
  2. Analyzing the number of unique values in the feature: A feature with a low number of unique values (e.g. less than a certain threshold) is likely to be categorical, while a feature with a high number of unique values is likely to be numeric. This approach works well for some datasets where the distinction between categorical and numeric features is clear, but it can be challenging to choose an appropriate threshold.
  3. Using domain knowledge: In some cases, the data scientist may have domain knowledge about the data and the meaning of the features that can be used to determine whether a feature is categorical or numeric.

It's worth noting that the classification of a feature as numeric or categorical can have a significant impact on the performance of machine learning models. In the case of AutoML, the specific approach used to classify features may depend on the particular algorithm being used, and how that algorithm is designed to handle different types of features.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group