Databricks Community

anonturtle · ‎01-30-2023

When running automl on its UI, it classifies a feature "local_convenience_store" as both a numeric and categorical column. This affects the result as for numeric columns a scaler is used while in a categorical column it is one hot encoded.

For context, the feature dtype on pandas is int32 and has values from 1 to 10.

Thus, I would like to ask how does automl classify which column is numeric and which column is categorical? Is it based on low-cardinality?

Thanks for viewing 🙂

Anonymous · ‎04-10-2023

@hr then :

The approach taken by AutoML to classify features as numeric or categorical depends on the specific AutoML framework or library being used, as different implementations may use different methods or heuristics to make this determination.

In general, some common approaches include:

Examining the data type of the feature: This is a simple and straightforward approach, where a feature with a data type of int, float or similar is considered numeric, while a feature with a string or object data type is considered categorical. However, this approach can be limited as some features may be represented as integers but are actually categorical variables (such as zip codes).
Analyzing the number of unique values in the feature: A feature with a low number of unique values (e.g. less than a certain threshold) is likely to be categorical, while a feature with a high number of unique values is likely to be numeric. This approach works well for some datasets where the distinction between categorical and numeric features is clear, but it can be challenging to choose an appropriate threshold.
Using domain knowledge: In some cases, the data scientist may have domain knowledge about the data and the meaning of the features that can be used to determine whether a feature is categorical or numeric.

It's worth noting that the classification of a feature as numeric or categorical can have a significant impact on the performance of machine learning models. In the case of AutoML, the specific approach used to classify features may depend on the particular algorithm being used, and how that algorithm is designed to handle different types of features.

Databricks Community

How does automl classify which feature is numeric or categorical?

Connect with Databricks Users in Your Area

Introducing an exclusively Databricks-hosted Assistant

How to present and share your Notebook insights in AI/BI Dashboards

Meet the Databricks MVPs

Now Hiring: Databricks Community Technical Moderator

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs