Hello @spearitchmeta , I looked internally to see if I could help with this and I found some information that will shed light on your question.
Here’s how missing (null) values in categorical (string) columns are handled in Databricks AutoML on Databricks Runtime 10.4 LTS ML+, and what I recommend for your classification workflow.
What AutoML does by default (DBR 10.4 LTS ML+)
- By default, AutoML selects an imputation method based on the column type and content. This applies to classification and regression workflows in DBR 10.4 LTS ML+, and you can override per-column strategies in the UI or API.
-
The API exposes explicit imputation strategies you can set per column: "mean", "median", "most_frequent", or "constant" with a fill_value. If you don’t specify a column, AutoML uses its type/content–based default.
-
In many AutoML–generated trial notebooks from DBR 10.x, numeric features are imputed with mean, and categorical features are primarily handled via one-hot encoding with handle_unknown="ignore" (no explicit categorical imputer shown in the standard pipeline). This means nulls aren’t dropped; they flow into the encoder stage.
-
In scikit-learn, OneHotEncoder controls unknown categories via handle_unknown (AutoML uses ignore commonly), and scikit-learn versions in this timeframe accept missing values and treat unknowns as all-zero encodings during transform. If you want a dedicated “Missing/Unknown” level, you should pre-fill or explicitly impute a sentinel category before encoding.
Practical implications for categorical nulls
- If your categorical columns contain nulls and the pipeline uses OneHotEncoder(handle_unknown="ignore"), they are not dropped; they pass through encoding. Depending on scikit-learn behavior for the version in DBR 10.4, nulls typically end up encoded either as all zeros (treated as unknown) or as their own level if present during fitting. If you need a guaranteed, explicit “Missing” category, it’s best to impute a sentinel string.
- You can confirm the exact behavior on your dataset by opening the generated trial notebook (Best model → “View notebook” in the experiment UI; other trials’ notebooks are stored as MLflow artifacts). The notebook shows the preprocessing pipeline used (imputers, encoders, etc.).
-
How to control imputation explicitly (recommended)
If you want categorical nulls to become a fixed label like "Unknown", set the API imputers parameter per column to {"strategy": "constant", "fill_value": "Unknown"}.
For example:
import databricks.automl as automl
imputers = {
"cat_col_1": {"strategy": "constant", "fill_value": "Unknown"},
"cat_col_2": {"strategy": "constant", "fill_value": "Unknown"},
# numeric example:
"num_col_1": "mean",
# mode for a categorical:
"cat_col_3": "most_frequent",
}
summary = automl.classify(
dataset=train_df,
target_col="label",
imputers=imputers,
timeout_minutes=60
)
Should you pre-impute or let AutoML handle it?
Here’s a simple decision guide:
-
If you need a stable, interpretable “Missing” level for business semantics or downstream logic: pre-impute (or set imputers to constant) to "Unknown" or similar so your category space is explicit and reproducible.
-
If you prefer data-driven imputation for categoricals: use most_frequent (mode) via the imputers API for those columns. This is common when nulls are sparse and you want alignment with the dominant category.
-
If you don’t have a strong preference and want to keep the AutoML defaults: let AutoML handle it automatically, but plan to inspect the generated best-trial notebook to verify the exact pipeline decisions for your dataset.
Hope this helps, Louis.