I've got data stored in feature tables, plus in a data lake. The feature tables are expected to lag the data lake by at least a little bit. I want to filter data coming out of the feature store by querying the data lake for lookup keys out of my index filtered by one or more properties (such as time, location, cost center, etc.). It is easy enough to do this by providing the result of my query as a dataframe and passing it into `FeatureStoreClient.create_training_set()`.
However, I run into a bit of trouble if I am trying to train a model where certain lookup keys are not present in the feature table. For instance, if I have keys 1-4 in the lake, and write a query that returns 3 and 5, then `create_training_set()` returns 3 next to its features and 5 next to a bunch of NULL values. This causes SKLearn to choke because the `fit()` function does not filter out bad data prior to training. At least I can mitigate this by running the output of `create_training_set()` through `dropna()` before running `fit()` to get rid of any rows having any NULL values.
Where this becomes truly problematic is running batch inference on an MLFlow model. When providing a set of lookup keys to `FeatureStoreClient.score_batch()` there is no way to intercept the keys on their way to being evaluated, and thus it fails with a `ValueError: Input contains NaN, infinity or a value too large for dtype('float64')`. I don't see an option to force `score_batch()` to remove keys unknown to the feature store, and neither of the workarounds I can conjure up seem incredibly clean. The first is to just query from the data lake directly for model inference and forget about the feature store altogether. The second is to use `create_training_set()` to return only the column for my dependent variable, but since it won't filter out the keys unknown to the feature store ahead of time, I end up having to fetch the entire row from the feature store, run `dropna()`, and then pass only the lookup key column into `score_batch()`. Is there any other cleaner way that I didn't think of?
Maybe all this is moot because people never run batch inferences from a feature store in practice (I would just classify on the raw data myself, personally), but the tester in me thinks it would be nice to provide ways to allow people not to inadvertently break the methods that have been provided. Anyway, thanks for your consideration, and hopefully before long we can at least have an option to prevent the Feature Store APIs from dealing with keys absent from the feature table.