cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Exclude absent lookup keys from dataframes made by create_training_set()

mrcity
New Contributor II

I've got data stored in feature tables, plus in a data lake. The feature tables are expected to lag the data lake by at least a little bit. I want to filter data coming out of the feature store by querying the data lake for lookup keys out of my index filtered by one or more properties (such as time, location, cost center, etc.). It is easy enough to do this by providing the result of my query as a dataframe and passing it into `FeatureStoreClient.create_training_set()`.

However, I run into a bit of trouble if I am trying to train a model where certain lookup keys are not present in the feature table. For instance, if I have keys 1-4 in the lake, and write a query that returns 3 and 5, then `create_training_set()` returns 3 next to its features and 5 next to a bunch of NULL values. This causes SKLearn to choke because the `fit()` function does not filter out bad data prior to training. At least I can mitigate this by running the output of `create_training_set()` through `dropna()` before running `fit()` to get rid of any rows having any NULL values.

Where this becomes truly problematic is running batch inference on an MLFlow model. When providing a set of lookup keys to `FeatureStoreClient.score_batch()` there is no way to intercept the keys on their way to being evaluated, and thus it fails with a `ValueError: Input contains NaN, infinity or a value too large for dtype('float64')`. I don't see an option to force `score_batch()` to remove keys unknown to the feature store, and neither of the workarounds I can conjure up seem incredibly clean. The first is to just query from the data lake directly for model inference and forget about the feature store altogether. The second is to use `create_training_set()` to return only the column for my dependent variable, but since it won't filter out the keys unknown to the feature store ahead of time, I end up having to fetch the entire row from the feature store, run `dropna()`, and then pass only the lookup key column into `score_batch()`. Is there any other cleaner way that I didn't think of?

Maybe all this is moot because people never run batch inferences from a feature store in practice (I would just classify on the raw data myself, personally), but the tester in me thinks it would be nice to provide ways to allow people not to inadvertently break the methods that have been provided. Anyway, thanks for your consideration, and hopefully before long we can at least have an option to prevent the Feature Store APIs from dealing with keys absent from the feature table.

3 REPLIES 3

Anonymous
Not applicable

@Stephen Wylie​ :

One approach to handle missing keys during batch inference would be to use a join between the lookup keys and the feature table. This would allow you to filter out the unknown keys before passing them to score_batch(), and avoid the issue with NaNs and dropna(). Here's an example of how you could implement this using PySpark:

from pyspark.sql.functions import col
 
# define the lookup keys you want to score
lookup_keys = ["key1", "key2", "key3", "key4", "key5"]
 
# create a DataFrame with the lookup keys
lookup_df = spark.createDataFrame([(key,) for key in lookup_keys], ["lookup_key"])
 
# read the feature table into a DataFrame
feature_df = spark.read.format("delta").load("path/to/feature/table")
 
# join the lookup keys with the feature table, and filter out unknown keys
score_df = lookup_df.join(feature_df, ["lookup_key"], "left_outer").filter(col("lookup_key").isNotNull())
 
# pass the resulting DataFrame to score_batch()
predictions = client.score_batch("model_name", score_df)

This should allow you to filter out unknown keys before passing them to score_batch(), without having to fetch the entire row from the feature store and run dropna().

Anonymous
Not applicable

Hi @Stephen Wylie​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Quinten
New Contributor II

I'm facing the same issue as described by @mrcity. There is no easy way to alter the dataframe, which is created inside the score_batch() function. Filtering out rows in the (sklearn) pipeline itself is also not convenient since these transformers are typically focused on the features.

The solution described here is quite clean, but goes against the idea of the 'feature-aware' batch inference provided by the FeatureStoreClient(). It is a workable work-around, but i.m.o. one should not need to provide the exact feature tables to do this filtering. It would be much better if the score_batch() would have the option to drop the NULL values from the dataframe before running the model on it.

If there are any other suggestions, I would like to hear it too.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group