cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Feature tables & Null Values

__paolo_c__
Contributor II

Hi!

I was wondering if any of you has ever dealt with Feature tables and null values (more specifically, via feature engineering objects, rather than feature store, although I don't think it really matters).

In brief, null values are allowed to be stored in features tables (as long as they aren't in the primary keys, of course) as some models (mainly the ones coming from the "tree family") can deal with them.

However, the problem I am facing now (first time with null values into features tables to be frank), is related to the methods to retrieve the data frame once the time for training comes: I can correctly define the training_set_df as:

training_set = fe.create_training_set(
  df=label_df,
  feature_lookups=lookups_list,
  label="TARGET",
  exclude_columns=primary_keys
 )
 
training_set_df = training_set.load_df()

But that's the lazy evaluation, if I try to use training_set_df like:

display(
  training_set_df
  .head(3)
)

I have been thrown the error: Some of types cannot be determined after inferring.

I tried two alternative solutions:

  • Option n.1; from the lookups, removing the fields which have null values only (within the current set of primary keys, of course I don't have an entire column of nulls in the overall feature table)
  • Option n.2; retrieve the schema (combined_schema) of the features while I create the lookups, and I define the training_set_df like:
training_set_df = spark.createDataFrame(
  training_set.load_df().collect(),
  schema=combined_schema
)​

None of the options above actually worked, which means I have the same error mentioned above (in red). So, 2 questions for you:

  1. Why load_df is not able to infer the schema from the feature store, even when the subset selected for training contains all nulls (in one or more columns)? Feature store knows the actual types!
  2. How can I solve the problem on my end?

Thanks!

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

When dealing with feature tables and null values—especially via Databricks Feature Engineering objects (but also more broadly in Spark or feature platforms)—there are some nuanced behaviors when schema inference is required. Here are clear answers to your two questions, supported by insights into Spark’s and Databricks Feature Engineering’s internals.

1. Why does load_df fail to infer schema when columns have only NULLs?

Root cause: If a column in the data frame contains only nulls (at least in your current selection/partition, not globally), Spark (which underlies Databricks Feature Engineering’s DataFrame operations) cannot infer the column’s type. This is because Spark’s default type inference looks at actual values, and a column of all nulls is typeless in practice. Unless the schema is explicitly provided or committed as metadata in the upstream feature table, the DataFrame ends up with columns of type NullType, which leads to ambiguous errors like “Some of types cannot be determined after inferring”.

Even if the Feature Store (or source table) knows the type in its metadata, the lazy-evaluated DataFrame produced by training_set.load_df() tries to infer the type based on the physical data pulled into your current partition, which could all be nulls due to filtering (such as with your current join/lookup selection) .

2. How can you solve the problem on your end?

Recommended Solutions

A. Explicitly Provide the Schema

  • When loading your DataFrame, you can explicitly set the schema for all columns, or at least those that might contain only nulls. This overrides Spark’s inference mechanism and “tells” it what type to expect.

  • This can be achieved either when materializing the upstream feature table, or by constructing the DataFrame with the exact schema, as with your attempted approach.

  • Example:

    python
    combined_schema = ... # Build this from your feature metadata/registry df_with_schema = spark.createDataFrame( training_set.load_df().collect(), schema=combined_schema )

    If this still fails, ensure combined_schema faithfully matches the source feature table’s column types (as registered in your feature store, not guessed from the null-containing DataFrame) .

B. Fill Missing Columns with Defaults Prior to Inference

  • Before using the DataFrame (e.g., before collect or display), fill any all-null columns with a dummy value (appropriate to their type), then cast back if needed:

    python
    # Identify columns of NullType inferred_schema = training_set.load_df().schema null_columns = [f.name for f in inferred_schema.fields if isinstance(f.dataType, NullType)] for col_name in null_columns: # Replace nulls with a default value, for example 0 for numeric, '' for string training_set_df = training_set_df.withColumn(col_name, F.lit(0).cast("desired_type"))
  • Once you have this working, you can revert/double-check your feature engineering logic to not select partitions that are known to include only nulls for given columns unless unavoidable.

C. Do Not Remove Columns with Nulls Only in Current Selection

  • Removing columns from your lookups where only the current slice has all nulls tends to be unreliable, because another partition/slice might have non-nulls, and schema drift might result.

Additional Tips

  • Double-check your feature store’s metadata (or table definition) for the expected schema. In Databricks Feature Engineering, you can often retrieve this directly via the API (see feature table describe/preview in the Databricks UI or via catalog commands).

  • If you join multiple sources, ensure that data types are aligned. Mismatches (e.g., joining a string and a numeric type) can also induce inferencing issues when nulls dominate one side.

  • As a stability guard, if your workflow allows, materialize the DataFrame to persistent storage (e.g., save as Parquet with schema) and reload.

References

  • Best practices for handling all-null columns and schema inference in Spark .

  • Databricks Feature Store schema behavior and handling nullable feature columns .


In summary: The problem is rooted in Spark’s inability to infer types for all-null columns at runtime, despite metadata being available in the feature store. The fix is to supply the schema explicitly at DataFrame creation, or fill those columns with default values to “nudge” Spark’s inference, using the registered feature types as the ground truth.