<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Feature tables &amp;amp; Null Values in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/feature-tables-amp-null-values/m-p/84083#M3604</link>
    <description>&lt;P&gt;Hi!&lt;/P&gt;&lt;P&gt;I was wondering if any of you has ever dealt with Feature tables and null values (more specifically, via feature engineering objects, rather than feature store, although I don't think it really matters).&lt;/P&gt;&lt;P&gt;In brief, null values are allowed to be stored in features tables (as long as they aren't in the primary keys, of course) as some models (mainly the ones coming from the "tree family") can deal with them.&lt;/P&gt;&lt;P&gt;However, the problem I am facing now (first time with null values into features tables to be frank), is related to the methods to retrieve the data frame once the time for training comes: I can correctly define the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;&lt;EM&gt;training_set_df&lt;/EM&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;as:&lt;/P&gt;&lt;PRE&gt;training_set = fe.create_training_set(
  df=label_df,
  feature_lookups=lookups_list,
  label="TARGET",
  exclude_columns=primary_keys
 )
 
training_set_df = training_set.load_df()&lt;/PRE&gt;&lt;P&gt;But that's the lazy evaluation, if I try to use&amp;nbsp;&lt;STRONG&gt;&lt;EM&gt;training_set_df&lt;/EM&gt;&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;like:&lt;/P&gt;&lt;PRE&gt;display(
  training_set_df
  .head(3)
)&lt;/PRE&gt;&lt;P&gt;I have been thrown the error:&amp;nbsp;&lt;FONT color="#FF0000"&gt;&lt;EM&gt;Some of types cannot be determined after inferring.&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;I tried two alternative solutions:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Option n.1&lt;/STRONG&gt;; from the lookups, removing the fields which have null values only (within the current set of primary keys, of course I don't have an entire column of nulls in the overall feature table)&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Option n.2&lt;/STRONG&gt;; retrieve the schema (&lt;EM&gt;&lt;STRONG&gt;combined_schema&lt;/STRONG&gt;&lt;/EM&gt;) of the features while I create the lookups, and I define the&amp;nbsp;&lt;STRONG&gt;&lt;EM&gt;training_set_df&lt;/EM&gt;&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;like:&lt;/LI&gt;&lt;/UL&gt;&lt;PRE&gt;training_set_df = spark.createDataFrame(
  training_set.load_df().collect(),
  schema=combined_schema
)​&lt;/PRE&gt;&lt;P&gt;None of the options above actually worked, which means I have the same error mentioned above (in red). So, 2 questions for you:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Why&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://api-docs.databricks.com/python/feature-engineering/latest/ml_features.training_set.html?highlight=load_df#databricks.ml_features.training_set.TrainingSet.load_df" target="_blank" rel="noopener nofollow noreferrer"&gt;load_df&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;is not able to infer the schema from the feature store, even when the subset selected for training contains all nulls (in one or more columns)? Feature store knows the actual types!&lt;/LI&gt;&lt;LI&gt;How can I solve the problem on my end?&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
    <pubDate>Fri, 23 Aug 2024 15:57:41 GMT</pubDate>
    <dc:creator>__paolo_c__</dc:creator>
    <dc:date>2024-08-23T15:57:41Z</dc:date>
    <item>
      <title>Feature tables &amp; Null Values</title>
      <link>https://community.databricks.com/t5/machine-learning/feature-tables-amp-null-values/m-p/84083#M3604</link>
      <description>&lt;P&gt;Hi!&lt;/P&gt;&lt;P&gt;I was wondering if any of you has ever dealt with Feature tables and null values (more specifically, via feature engineering objects, rather than feature store, although I don't think it really matters).&lt;/P&gt;&lt;P&gt;In brief, null values are allowed to be stored in features tables (as long as they aren't in the primary keys, of course) as some models (mainly the ones coming from the "tree family") can deal with them.&lt;/P&gt;&lt;P&gt;However, the problem I am facing now (first time with null values into features tables to be frank), is related to the methods to retrieve the data frame once the time for training comes: I can correctly define the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;&lt;EM&gt;training_set_df&lt;/EM&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;as:&lt;/P&gt;&lt;PRE&gt;training_set = fe.create_training_set(
  df=label_df,
  feature_lookups=lookups_list,
  label="TARGET",
  exclude_columns=primary_keys
 )
 
training_set_df = training_set.load_df()&lt;/PRE&gt;&lt;P&gt;But that's the lazy evaluation, if I try to use&amp;nbsp;&lt;STRONG&gt;&lt;EM&gt;training_set_df&lt;/EM&gt;&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;like:&lt;/P&gt;&lt;PRE&gt;display(
  training_set_df
  .head(3)
)&lt;/PRE&gt;&lt;P&gt;I have been thrown the error:&amp;nbsp;&lt;FONT color="#FF0000"&gt;&lt;EM&gt;Some of types cannot be determined after inferring.&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;I tried two alternative solutions:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Option n.1&lt;/STRONG&gt;; from the lookups, removing the fields which have null values only (within the current set of primary keys, of course I don't have an entire column of nulls in the overall feature table)&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Option n.2&lt;/STRONG&gt;; retrieve the schema (&lt;EM&gt;&lt;STRONG&gt;combined_schema&lt;/STRONG&gt;&lt;/EM&gt;) of the features while I create the lookups, and I define the&amp;nbsp;&lt;STRONG&gt;&lt;EM&gt;training_set_df&lt;/EM&gt;&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;like:&lt;/LI&gt;&lt;/UL&gt;&lt;PRE&gt;training_set_df = spark.createDataFrame(
  training_set.load_df().collect(),
  schema=combined_schema
)​&lt;/PRE&gt;&lt;P&gt;None of the options above actually worked, which means I have the same error mentioned above (in red). So, 2 questions for you:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Why&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://api-docs.databricks.com/python/feature-engineering/latest/ml_features.training_set.html?highlight=load_df#databricks.ml_features.training_set.TrainingSet.load_df" target="_blank" rel="noopener nofollow noreferrer"&gt;load_df&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;is not able to infer the schema from the feature store, even when the subset selected for training contains all nulls (in one or more columns)? Feature store knows the actual types!&lt;/LI&gt;&lt;LI&gt;How can I solve the problem on my end?&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Fri, 23 Aug 2024 15:57:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/feature-tables-amp-null-values/m-p/84083#M3604</guid>
      <dc:creator>__paolo_c__</dc:creator>
      <dc:date>2024-08-23T15:57:41Z</dc:date>
    </item>
    <item>
      <title>Re: Feature tables &amp; Null Values</title>
      <link>https://community.databricks.com/t5/machine-learning/feature-tables-amp-null-values/m-p/138823#M4431</link>
      <description>&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;When dealing with feature tables and null values—especially via Databricks Feature Engineering objects (but also more broadly in Spark or feature platforms)—there are some nuanced behaviors when schema inference is required. Here are clear answers to your two questions, supported by insights into Spark’s and Databricks Feature Engineering’s internals.&lt;/P&gt;
&lt;H2 id="" class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0 md:text-lg [hr+&amp;amp;]:mt-4"&gt;1. Why does&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;load_df&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;fail to infer schema when columns have only NULLs?&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Root cause&lt;/STRONG&gt;: If a column in the data frame contains only nulls (at least in your current selection/partition, not globally), Spark (which underlies Databricks Feature Engineering’s DataFrame operations) cannot infer the column’s type. This is because Spark’s default type inference looks at actual values, and a column of all nulls is typeless in practice. Unless the schema is explicitly provided or committed as metadata in the upstream feature table, the DataFrame ends up with columns of type&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;NullType&lt;/CODE&gt;, which leads to ambiguous errors like “Some of types cannot be determined after inferring”.&lt;/P&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Even if&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;the Feature Store (or source table)&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;EM&gt;knows&lt;/EM&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;the type in its metadata, the lazy-evaluated DataFrame produced by&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;training_set.load_df()&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;tries to infer the type based on the physical data pulled into your current partition, which could all be nulls due to filtering (such as with your current join/lookup selection) .&lt;/P&gt;
&lt;H2 id="2-how-can-you-solve-the-problem-on-your-end" class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0 md:text-lg [hr+&amp;amp;]:mt-4"&gt;2. How can you solve the problem on your end?&lt;/H2&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Recommended Solutions&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;A. Explicitly Provide the Schema&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;When loading your DataFrame, you can explicitly set the schema for all columns, or at least those that might contain only nulls. This overrides Spark’s inference mechanism and “tells” it what type to expect.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;This can be achieved either when materializing the upstream feature table, or by constructing the DataFrame with the exact schema, as with your attempted approach.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Example:&lt;/P&gt;
&lt;DIV class="w-full md:max-w-[90vw]"&gt;
&lt;DIV class="codeWrapper text-light selection:text-super selection:bg-super/10 my-md relative flex flex-col rounded font-mono text-sm font-normal bg-subtler"&gt;
&lt;DIV class="translate-y-xs -translate-x-xs bottom-xl mb-xl flex h-0 items-start justify-end md:sticky md:top-[100px]"&gt;
&lt;DIV class="overflow-hidden rounded-full border-subtlest ring-subtlest divide-subtlest bg-base"&gt;
&lt;DIV class="border-subtlest ring-subtlest divide-subtlest bg-subtler"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="-mt-xl"&gt;
&lt;DIV&gt;
&lt;DIV class="text-quiet bg-subtle py-xs px-sm inline-block rounded-br rounded-tl-[3px] font-thin" data-testid="code-language-indicator"&gt;python&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&lt;CODE&gt;combined_schema &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; &lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;  &lt;SPAN class="token token"&gt;# Build this from your feature metadata/registry&lt;/SPAN&gt;
df_with_schema &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; spark&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;createDataFrame&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;
    training_set&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;load_df&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;collect&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt;
    schema&lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt;combined_schema
&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
&lt;/CODE&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;If this still fails, ensure&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;combined_schema&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;faithfully matches the source feature table’s column types (as registered in your feature store, not guessed from the null-containing DataFrame) .&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;B. Fill Missing Columns with Defaults Prior to Inference&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Before using the DataFrame (e.g., before collect or display), fill any all-null columns with a dummy value (appropriate to their type), then cast back if needed:&lt;/P&gt;
&lt;DIV class="w-full md:max-w-[90vw]"&gt;
&lt;DIV class="codeWrapper text-light selection:text-super selection:bg-super/10 my-md relative flex flex-col rounded font-mono text-sm font-normal bg-subtler"&gt;
&lt;DIV class="translate-y-xs -translate-x-xs bottom-xl mb-xl flex h-0 items-start justify-end md:sticky md:top-[100px]"&gt;
&lt;DIV class="overflow-hidden rounded-full border-subtlest ring-subtlest divide-subtlest bg-base"&gt;
&lt;DIV class="border-subtlest ring-subtlest divide-subtlest bg-subtler"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="-mt-xl"&gt;
&lt;DIV&gt;
&lt;DIV class="text-quiet bg-subtle py-xs px-sm inline-block rounded-br rounded-tl-[3px] font-thin" data-testid="code-language-indicator"&gt;python&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&lt;CODE&gt;&lt;SPAN class="token token"&gt;# Identify columns of NullType&lt;/SPAN&gt;
inferred_schema &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; training_set&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;load_df&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;schema
null_columns &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; &lt;SPAN class="token token punctuation"&gt;[&lt;/SPAN&gt;f&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;name &lt;SPAN class="token token"&gt;for&lt;/SPAN&gt; f &lt;SPAN class="token token"&gt;in&lt;/SPAN&gt; inferred_schema&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;fields &lt;SPAN class="token token"&gt;if&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;isinstance&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;f&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;dataType&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; NullType&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;]&lt;/SPAN&gt;
&lt;SPAN class="token token"&gt;for&lt;/SPAN&gt; col_name &lt;SPAN class="token token"&gt;in&lt;/SPAN&gt; null_columns&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
    &lt;SPAN class="token token"&gt;# Replace nulls with a default value, for example 0 for numeric, '' for string&lt;/SPAN&gt;
    training_set_df &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; training_set_df&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;withColumn&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;col_name&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; F&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;lit&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;0&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;cast&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;"desired_type"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
&lt;/CODE&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Once you have this working, you can revert/double-check your feature engineering logic to not select partitions that are known to include only nulls for given columns unless unavoidable.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;C. Do Not Remove Columns with Nulls Only in Current Selection&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Removing columns from your lookups where only the current slice has all nulls tends to be unreliable, because another partition/slice might have non-nulls, and schema drift might result.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Additional Tips&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Double-check your feature store’s metadata (or table definition) for the expected schema. In Databricks Feature Engineering, you can often retrieve this directly via the API (see feature table describe/preview in the Databricks UI or via catalog commands).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;If you join multiple sources, ensure that data types are aligned. Mismatches (e.g., joining a string and a numeric type) can also induce inferencing issues when nulls dominate one side.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;As a stability guard, if your workflow allows, materialize the DataFrame to persistent storage (e.g., save as Parquet with schema) and reload.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 id="references" class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0 md:text-lg [hr+&amp;amp;]:mt-4"&gt;References&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Best practices for handling all-null columns and schema inference in Spark .&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Databricks Feature Store schema behavior and handling nullable feature columns .&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;HR /&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;In summary:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;The problem is rooted in Spark’s inability to infer types for all-null columns at runtime, despite metadata being available in the feature store. The fix is to supply the schema explicitly at DataFrame creation, or fill those columns with default values to “nudge” Spark’s inference, using the registered feature types as the ground truth.&lt;/P&gt;</description>
      <pubDate>Wed, 12 Nov 2025 17:04:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/feature-tables-amp-null-values/m-p/138823#M4431</guid>
      <dc:creator>mark_ott</dc:creator>
      <dc:date>2025-11-12T17:04:16Z</dc:date>
    </item>
  </channel>
</rss>

