<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How does Databricks AutoML handle null imputation for categorical features by default? in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/how-does-databricks-automl-handle-null-imputation-for/m-p/135805#M4356</link>
    <description>&lt;P&gt;Hi everyone&amp;nbsp;&lt;/P&gt;&lt;P&gt;I’m using &lt;STRONG&gt;Databricks AutoML&lt;/STRONG&gt; (classification workflow) on &lt;STRONG&gt;Databricks Runtime 10.4 LTS ML+&lt;/STRONG&gt;, and I’d like to clarify how &lt;STRONG&gt;missing (null) values&lt;/STRONG&gt; are handled for &lt;STRONG&gt;categorical (string) columns&lt;/STRONG&gt; by default.&lt;/P&gt;&lt;P&gt;From the &lt;A class="" href="https://learn.microsoft.com/en-us/azure/databricks/machine-learning/automl/classification-data-prep?utm_source=chatgpt.com" target="_new" rel="noopener"&gt;AutoML documentation&lt;/A&gt;, I see that:&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;“By default, AutoML selects an imputation method based on the column type and content.”&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;and that the imputers parameter allows manual overrides.&lt;BR /&gt;However, the docs don’t specify &lt;STRONG&gt;what the default strategy actually is&lt;/STRONG&gt; for categorical features — e.g.:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Does AutoML create a new category such as "Unknown" or "Missing"?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Or does it impute with the &lt;STRONG&gt;mode/frequent category&lt;/STRONG&gt;?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Or possibly drop rows / rely on the downstream model’s encoder to handle nulls?&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I’d appreciate any clarification from Databricks engineers or users who have inspected the generated AutoML notebooks or pipelines to see what happens in practice.&lt;/P&gt;&lt;P&gt;I’m building a classification model with several categorical variables&lt;BR /&gt;Before running AutoML, I’d like to understand whether I should:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Pre-impute nulls manually (e.g., "Unknown" category), or&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Let AutoML’s internal preprocessing handle them automatically.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Thanks in advance for any pointers or references!&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 23 Oct 2025 08:57:58 GMT</pubDate>
    <dc:creator>spearitchmeta</dc:creator>
    <dc:date>2025-10-23T08:57:58Z</dc:date>
    <item>
      <title>How does Databricks AutoML handle null imputation for categorical features by default?</title>
      <link>https://community.databricks.com/t5/machine-learning/how-does-databricks-automl-handle-null-imputation-for/m-p/135805#M4356</link>
      <description>&lt;P&gt;Hi everyone&amp;nbsp;&lt;/P&gt;&lt;P&gt;I’m using &lt;STRONG&gt;Databricks AutoML&lt;/STRONG&gt; (classification workflow) on &lt;STRONG&gt;Databricks Runtime 10.4 LTS ML+&lt;/STRONG&gt;, and I’d like to clarify how &lt;STRONG&gt;missing (null) values&lt;/STRONG&gt; are handled for &lt;STRONG&gt;categorical (string) columns&lt;/STRONG&gt; by default.&lt;/P&gt;&lt;P&gt;From the &lt;A class="" href="https://learn.microsoft.com/en-us/azure/databricks/machine-learning/automl/classification-data-prep?utm_source=chatgpt.com" target="_new" rel="noopener"&gt;AutoML documentation&lt;/A&gt;, I see that:&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;“By default, AutoML selects an imputation method based on the column type and content.”&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;and that the imputers parameter allows manual overrides.&lt;BR /&gt;However, the docs don’t specify &lt;STRONG&gt;what the default strategy actually is&lt;/STRONG&gt; for categorical features — e.g.:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Does AutoML create a new category such as "Unknown" or "Missing"?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Or does it impute with the &lt;STRONG&gt;mode/frequent category&lt;/STRONG&gt;?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Or possibly drop rows / rely on the downstream model’s encoder to handle nulls?&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I’d appreciate any clarification from Databricks engineers or users who have inspected the generated AutoML notebooks or pipelines to see what happens in practice.&lt;/P&gt;&lt;P&gt;I’m building a classification model with several categorical variables&lt;BR /&gt;Before running AutoML, I’d like to understand whether I should:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Pre-impute nulls manually (e.g., "Unknown" category), or&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Let AutoML’s internal preprocessing handle them automatically.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Thanks in advance for any pointers or references!&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 23 Oct 2025 08:57:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/how-does-databricks-automl-handle-null-imputation-for/m-p/135805#M4356</guid>
      <dc:creator>spearitchmeta</dc:creator>
      <dc:date>2025-10-23T08:57:58Z</dc:date>
    </item>
    <item>
      <title>Re: How does Databricks AutoML handle null imputation for categorical features by default?</title>
      <link>https://community.databricks.com/t5/machine-learning/how-does-databricks-automl-handle-null-imputation-for/m-p/135891#M4359</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/178472"&gt;@spearitchmeta&lt;/a&gt;&amp;nbsp;, I looked internally to see if I could help with this and I found some information that will shed light on your question.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;DIV class="paragraph"&gt;Here’s how missing (null) values in categorical (string) columns are handled in Databricks AutoML on Databricks Runtime 10.4 LTS ML+, and what I recommend for your classification workflow.&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;H3 class="paragraph"&gt;What AutoML does by default (DBR 10.4 LTS ML+)&lt;/H3&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;UL&gt;
&lt;LI class="paragraph"&gt;By default, &lt;STRONG&gt;AutoML selects an imputation method based on the column type and content&lt;/STRONG&gt;. This applies to classification and regression workflows in DBR 10.4 LTS ML+, and you can override per-column strategies in the UI or API.&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;The &lt;STRONG&gt;API exposes explicit imputation strategies&lt;/STRONG&gt; you can set per column: "mean", "median", "most_frequent", or "constant" with a fill_value. If you don’t specify a column, AutoML uses its type/content–based default.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;In many AutoML–generated trial notebooks from DBR 10.x, &lt;STRONG&gt;numeric&lt;/STRONG&gt; features are imputed with mean, and &lt;STRONG&gt;categorical&lt;/STRONG&gt; features are primarily handled via one-hot encoding with &lt;CODE&gt;handle_unknown="ignore"&lt;/CODE&gt; (no explicit categorical imputer shown in the standard pipeline). This means nulls aren’t dropped; they flow into the encoder stage.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;In scikit-learn, &lt;STRONG&gt;OneHotEncoder&lt;/STRONG&gt; controls unknown categories via &lt;CODE&gt;handle_unknown&lt;/CODE&gt; (AutoML uses &lt;CODE&gt;ignore&lt;/CODE&gt; commonly), and scikit-learn versions in this timeframe accept missing values and treat unknowns as all-zero encodings during transform. If you want a dedicated “Missing/Unknown” level, you should pre-fill or explicitly impute a sentinel category before encoding.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="paragraph"&gt;Practical implications for categorical nulls&lt;/H3&gt;
&lt;UL&gt;
&lt;LI class="paragraph"&gt;If your categorical columns contain nulls and the pipeline uses &lt;STRONG&gt;OneHotEncoder(handle_unknown="ignore")&lt;/STRONG&gt;, they are not dropped; they pass through encoding. Depending on scikit-learn behavior for the version in DBR 10.4, nulls typically end up encoded either as all zeros (treated as unknown) or as their own level if present during fitting. If you need a guaranteed, explicit “Missing” category, it’s best to impute a sentinel string.&lt;/LI&gt;
&lt;LI&gt;You can &lt;STRONG&gt;confirm the exact behavior&lt;/STRONG&gt; on your dataset by opening the generated trial notebook (Best model → “View notebook” in the experiment UI; other trials’ notebooks are stored as MLflow artifacts). The notebook shows the preprocessing pipeline used (imputers, encoders, etc.).&lt;/LI&gt;
&lt;LI&gt;&amp;nbsp;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="paragraph"&gt;How to control imputation explicitly (recommended)&lt;/H3&gt;
&lt;DIV class="paragraph"&gt;If you want categorical nulls to become a fixed label like "Unknown", set the API &lt;CODE&gt;imputers&lt;/CODE&gt; parameter per column to &lt;CODE&gt;{"strategy": "constant", "fill_value": "Unknown"}&lt;/CODE&gt;.&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;For example:&lt;/DIV&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-python"&gt;import databricks.automl as automl

imputers = {
  "cat_col_1": {"strategy": "constant", "fill_value": "Unknown"},
  "cat_col_2": {"strategy": "constant", "fill_value": "Unknown"},
  # numeric example:
  "num_col_1": "mean",
  # mode for a categorical:
  "cat_col_3": "most_frequent",
}

summary = automl.classify(
    dataset=train_df,
    target_col="label",
    imputers=imputers,
    timeout_minutes=60
)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;H3 class="paragraph"&gt;Should you pre-impute or let AutoML handle it?&lt;/H3&gt;
&lt;DIV class="paragraph"&gt;Here’s a simple decision guide:&lt;/DIV&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;If you need a stable, interpretable “Missing” level for business semantics or downstream logic: &lt;STRONG&gt;pre-impute (or set &lt;CODE&gt;imputers&lt;/CODE&gt; to constant)&lt;/STRONG&gt; to "Unknown" or similar so your category space is explicit and reproducible.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;If you prefer data-driven imputation for categoricals: &lt;STRONG&gt;use &lt;CODE&gt;most_frequent&lt;/CODE&gt; (mode)&lt;/STRONG&gt; via the &lt;CODE&gt;imputers&lt;/CODE&gt; API for those columns. This is common when nulls are sparse and you want alignment with the dominant category.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;If you don’t have a strong preference and want to keep the AutoML defaults: &lt;STRONG&gt;let AutoML handle it automatically&lt;/STRONG&gt;, but plan to inspect the generated best-trial notebook to verify the exact pipeline decisions for your dataset.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI class="paragraph" style="list-style-type: none;"&gt;&amp;nbsp;&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;Hope this helps, Louis.&lt;/DIV&gt;</description>
      <pubDate>Thu, 23 Oct 2025 19:07:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/how-does-databricks-automl-handle-null-imputation-for/m-p/135891#M4359</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-10-23T19:07:39Z</dc:date>
    </item>
  </channel>
</rss>

