Databricks Community

Quinten · ‎08-12-2024

Hi,

I'm using the Feature Store to train an ml model and log it using MLflow and FeatureStoreClient(). This model is then used for inference.

I understand the schema of the TrainingSet should not differ between training time and inference time. However, during training, an additional "weight" column is required to guide the model's learning process. These weights are not available during inference time when using score_batch().

I'm trying to find a clean work-around for this schema difference, while still using the Feature Store. I tried:

Including the "weight" column in the create_trainig_set() for training --> Not possible, column not available during inference.
Joining the "weight" column after create_training_set() during training --> Not possible, keys are dropped in the TrainingSet.
Dropping the "weight" column after create_training_set() --> I can't find a method to drop it completely from the TrainingSet.

Any suggestions?

KumaranT · ‎08-13-2024

Hi @Quinten,

You can consider creating a custom feature group to store the "weight" column during training. This way, you can keep the schema of the TrainingSet consistent between training and inference time.

Here are the steps you can follow:

Create a new feature group with the same schema as your TrainingSet, but with an additional "weight" column.
During training, join the TrainingSet with the new feature group to add the "weight" column.
After training, you can drop the "weight" column from the TrainingSet using the drop_columns method provided by the FeatureStoreClient.

During inference, you can use the original TrainingSet without the "weight" column.

Here's some sample code to illustrate the steps:

1 # Create a new feature group with the "weight" column
2 weight_feature_group = fs.create_feature_group(
3    name="weight_feature_group",
4    table_name="weight_feature_group_table",
5    primary_keys=["primary_key_column"],
6    schema={
7        "primary_key_column": "string",
8        "weight": "double"
9    }
10)
11
12 # Join the TrainingSet with the new feature group during training
13 training_set_with_weight = training_set.join(weight_feature_group, on="primary_key_column")
14
15 #Drop the "weight" column from the TrainingSet after training
16 training_set = training_set.drop_columns(["weight"])
17
18 #Use the original TrainingSet without the "weight" column during inference
19 inference_set = fs.get_historical_features(feature_group_names=["inference_feature_group"])

This approach allows you to keep the schema of the TrainingSet consistent between training and inference time while still using the Feature Store.

Quinten · ‎08-16-2024

Thanks for the response @KumaranT .

Unfortunately, training_set has no attribute 'join'. For that to work you would first need to load the df using training_set.load_df(). However, this dataframe contains no primary keys, thus joining on keys is not possible. Or am I missing something?

I created a work-around by joining on the index, but it is not a clean solution.

Databricks Community

TrainingSet schema difference during training and inference

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!