cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

How does Databricks AutoML handle null imputation for categorical features by default?

spearitchmeta
Contributor

Hi everyone 

I’m using Databricks AutoML (classification workflow) on Databricks Runtime 10.4 LTS ML+, and I’d like to clarify how missing (null) values are handled for categorical (string) columns by default.

From the AutoML documentation, I see that:

“By default, AutoML selects an imputation method based on the column type and content.”

and that the imputers parameter allows manual overrides.
However, the docs don’t specify what the default strategy actually is for categorical features — e.g.:

  • Does AutoML create a new category such as "Unknown" or "Missing"?

  • Or does it impute with the mode/frequent category?

  • Or possibly drop rows / rely on the downstream model’s encoder to handle nulls?

I’d appreciate any clarification from Databricks engineers or users who have inspected the generated AutoML notebooks or pipelines to see what happens in practice.

I’m building a classification model with several categorical variables
Before running AutoML, I’d like to understand whether I should:

  • Pre-impute nulls manually (e.g., "Unknown" category), or

  • Let AutoML’s internal preprocessing handle them automatically.

Thanks in advance for any pointers or references! 

1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

Hello @spearitchmeta , I looked internally to see if I could help with this and I found some information that will shed light on your question.

 

Here’s how missing (null) values in categorical (string) columns are handled in Databricks AutoML on Databricks Runtime 10.4 LTS ML+, and what I recommend for your classification workflow.
 

What AutoML does by default (DBR 10.4 LTS ML+)

 
  • By default, AutoML selects an imputation method based on the column type and content. This applies to classification and regression workflows in DBR 10.4 LTS ML+, and you can override per-column strategies in the UI or API.
  • The API exposes explicit imputation strategies you can set per column: "mean", "median", "most_frequent", or "constant" with a fill_value. If you don’t specify a column, AutoML uses its type/content–based default.
  • In many AutoML–generated trial notebooks from DBR 10.x, numeric features are imputed with mean, and categorical features are primarily handled via one-hot encoding with handle_unknown="ignore" (no explicit categorical imputer shown in the standard pipeline). This means nulls aren’t dropped; they flow into the encoder stage.
  • In scikit-learn, OneHotEncoder controls unknown categories via handle_unknown (AutoML uses ignore commonly), and scikit-learn versions in this timeframe accept missing values and treat unknowns as all-zero encodings during transform. If you want a dedicated “Missing/Unknown” level, you should pre-fill or explicitly impute a sentinel category before encoding.

Practical implications for categorical nulls

  • If your categorical columns contain nulls and the pipeline uses OneHotEncoder(handle_unknown="ignore"), they are not dropped; they pass through encoding. Depending on scikit-learn behavior for the version in DBR 10.4, nulls typically end up encoded either as all zeros (treated as unknown) or as their own level if present during fitting. If you need a guaranteed, explicit “Missing” category, it’s best to impute a sentinel string.
  • You can confirm the exact behavior on your dataset by opening the generated trial notebook (Best model → “View notebook” in the experiment UI; other trials’ notebooks are stored as MLflow artifacts). The notebook shows the preprocessing pipeline used (imputers, encoders, etc.).
  •  

How to control imputation explicitly (recommended)

If you want categorical nulls to become a fixed label like "Unknown", set the API imputers parameter per column to {"strategy": "constant", "fill_value": "Unknown"}.
 
For example:
import databricks.automl as automl

imputers = {
  "cat_col_1": {"strategy": "constant", "fill_value": "Unknown"},
  "cat_col_2": {"strategy": "constant", "fill_value": "Unknown"},
  # numeric example:
  "num_col_1": "mean",
  # mode for a categorical:
  "cat_col_3": "most_frequent",
}

summary = automl.classify(
    dataset=train_df,
    target_col="label",
    imputers=imputers,
    timeout_minutes=60
)
 

Should you pre-impute or let AutoML handle it?

Here’s a simple decision guide:
  • If you need a stable, interpretable “Missing” level for business semantics or downstream logic: pre-impute (or set imputers to constant) to "Unknown" or similar so your category space is explicit and reproducible.
  • If you prefer data-driven imputation for categoricals: use most_frequent (mode) via the imputers API for those columns. This is common when nulls are sparse and you want alignment with the dominant category.
  • If you don’t have a strong preference and want to keep the AutoML defaults: let AutoML handle it automatically, but plan to inspect the generated best-trial notebook to verify the exact pipeline decisions for your dataset.
  •  
 
Hope this helps, Louis.