Feature tables & Null Values

__paolo_c__ — Fri, 23 Aug 2024 15:57:41 GMT

Hi!

I was wondering if any of you has ever dealt with Feature tables and null values (more specifically, via feature engineering objects, rather than feature store, although I don't think it really matters).

In brief, null values are allowed to be stored in features tables (as long as they aren't in the primary keys, of course) as some models (mainly the ones coming from the "tree family") can deal with them.

However, the problem I am facing now (first time with null values into features tables to be frank), is related to the methods to retrieve the data frame once the time for training comes: I can correctly define the training_set_df as:

training_set = fe.create_training_set(
  df=label_df,
  feature_lookups=lookups_list,
  label="TARGET",
  exclude_columns=primary_keys
 )
 
training_set_df = training_set.load_df()

But that's the lazy evaluation, if I try to use training_set_df like:

display(
  training_set_df
  .head(3)
)

I have been thrown the error: Some of types cannot be determined after inferring.

I tried two alternative solutions:

Option n.1; from the lookups, removing the fields which have null values only (within the current set of primary keys, of course I don't have an entire column of nulls in the overall feature table)
Option n.2; retrieve the schema (combined_schema) of the features while I create the lookups, and I define the training_set_df like:

training_set_df = spark.createDataFrame(
  training_set.load_df().collect(),
  schema=combined_schema
)

None of the options above actually worked, which means I have the same error mentioned above (in red). So, 2 questions for you:

Why load_df is not able to infer the schema from the feature store, even when the subset selected for training contains all nulls (in one or more columns)? Feature store knows the actual types!
How can I solve the problem on my end?

Thanks!

Re: Feature tables & Null Values

mark_ott — Wed, 12 Nov 2025 17:04:16 GMT

When dealing with feature tables and null values—especially via Databricks Feature Engineering objects (but also more broadly in Spark or feature platforms)—there are some nuanced behaviors when schema inference is required. Here are clear answers to your two questions, supported by insights into Spark’s and Databricks Feature Engineering’s internals.

1. Why does `load_df` fail to infer schema when columns have only NULLs?

Root cause: If a column in the data frame contains only nulls (at least in your current selection/partition, not globally), Spark (which underlies Databricks Feature Engineering’s DataFrame operations) cannot infer the column’s type. This is because Spark’s default type inference looks at actual values, and a column of all nulls is typeless in practice. Unless the schema is explicitly provided or committed as metadata in the upstream feature table, the DataFrame ends up with columns of type NullType, which leads to ambiguous errors like “Some of types cannot be determined after inferring”.

Even if the Feature Store (or source table) knows the type in its metadata, the lazy-evaluated DataFrame produced by training_set.load_df() tries to infer the type based on the physical data pulled into your current partition, which could all be nulls due to filtering (such as with your current join/lookup selection) .

2. How can you solve the problem on your end?

Additional Tips

Double-check your feature store’s metadata (or table definition) for the expected schema. In Databricks Feature Engineering, you can often retrieve this directly via the API (see feature table describe/preview in the Databricks UI or via catalog commands).
If you join multiple sources, ensure that data types are aligned. Mismatches (e.g., joining a string and a numeric type) can also induce inferencing issues when nulls dominate one side.
As a stability guard, if your workflow allows, materialize the DataFrame to persistent storage (e.g., save as Parquet with schema) and reload.

References

Best practices for handling all-null columns and schema inference in Spark .
Databricks Feature Store schema behavior and handling nullable feature columns .

In summary: The problem is rooted in Spark’s inability to infer types for all-null columns at runtime, despite metadata being available in the feature store. The fix is to supply the schema explicitly at DataFrame creation, or fill those columns with default values to “nudge” Spark’s inference, using the registered feature types as the ground truth.

topic Feature tables & Null Values in Machine Learning

Feature tables & Null Values

Re: Feature tables & Null Values

1. Why does `load_df` fail to infer schema when columns have only NULLs?

2. How can you solve the problem on your end?

Recommended Solutions

Additional Tips

References

topic Feature tables & Null Values in Machine Learning

Feature tables & Null Values

Re: Feature tables & Null Values

1. Why does load_df fail to infer schema when columns have only NULLs?

2. How can you solve the problem on your end?

Recommended Solutions

Additional Tips

References

1. Why does `load_df` fail to infer schema when columns have only NULLs?