cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Handling Null Values in Feature Stores

NaeemS
New Contributor III

Hi, I am using multiple feature stores in my workflow using feature lookups. In my logged pipeline, I have several stages, including Assembler, Standard Scaler, Indexer and then Model. However, I am facing an issue during inference using the `score batch` function.

If any such identifier exists which does not have all the pre-computed values in feature stores, the join operation based on feature lookups will assign a null value, and then that null value will be passed directly to the model in the `score batch` function. Is there any way to handle this? I have tried the following methods until now:

  • Defining an initial stage of custom transformer in my pipeline to handle such columns. But in order to use it properly I will have to log this additional code along with my model. This can be done with Mlflow using the code_path parameter, but the feature store `log_model` method does not provide this parameter. 
  • Feature store provides a FeatureFunction method to calculate on demand features, but this method is used for adding additional columns to our resultant dataframe. Can we leverage this method to handle null values of some columns by defining logic in the functions to replace them with nulls?

 

Thanks.

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @NaeemS , Handling null values in feature stores is crucial to ensure robustness and reliability in your machine learning pipelines.

Let’s explore some strategies to address this issue:

  1. Custom Transformer Stage:

    • You’ve already considered adding an initial custom transformer stage to handle null values. While this approach can work, it does introduce additional code that needs to be logged alongside your model.
    • Unfortunately, the feature store’s log_model method doesn’t currently provide a code_path parameter, which would have been ideal for including custom logic.
    • If you decide to proceed with this approach, make sure to document the custom transformer’s behavior thoroughly.
  2. FeatureFunction Method:

    • The FeatureFunction method in the feature store is primarily designed for calculating on-demand features (i.e., adding new columns to your dataframe).
    • However, you can indeed leverage this method creatively to handle null values. Here’s how:
      • Define a custom feature function that takes the relevant columns as input.
      • Inside the function, implement logic to replace null values with an appropriate default (e.g., zero, mean, median, or a sentinel value).
      • Use this feature function to create new features that replace the original columns with the modified values.
      • Remember that this approach doesn’t directly replace nulls in the original feature store but provides an alternative way to handle them during inference.
  3. Imputation Techniques:

    • Consider imputing missing values before joining feature stores. Some common imputation methods include:
      • Mean/Median Imputation: Replace null values with the mean or median of the corresponding feature.
      • Forward/Backward Fill: Propagate the last non-null value forward or the next non-null value backward.
      • Model-Based Imputation: Train a model (e.g., regression) to predict missing values based on other features.
    • Choose an imputation strategy based on the nature of your data and the specific features.
  4. Model-Specific Handling:

    • Depending on the machine learning model you’re using, some models can handle missing values directly during inference.
    • For example:
      • Tree-based models (e.g., decision trees, random forests, XGBoost) can handle missing values without explicit imputation.
      • Deep learning models (e.g., neural networks) may require imputed data.
    • If your model allows it, you can pass null values directly to the model during scoring, and it will handle them appropriately.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group