cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

TrainingSet schema difference during training and inference

Quinten
New Contributor II

Hi,

I'm using the Feature Store to train an ml model and log it using MLflow and FeatureStoreClient(). This model is then used for inference.

I understand the schema of the TrainingSet should not differ between training time and inference time. However, during training, an additional "weight" column is required to guide the model's learning process. These weights are not available during inference time when using score_batch().

I'm trying to find a clean work-around for this schema difference, while still using the Feature Store. I tried:

  1. Including the "weight" column in the create_trainig_set() for training --> Not possible, column not available during inference.
  2. Joining the "weight" column after create_training_set() during training --> Not possible, keys are dropped in the TrainingSet.
  3. Dropping the "weight" column after create_training_set() --> I can't find a method to drop it completely from the TrainingSet.

Any suggestions?

2 REPLIES 2

KumaranT
New Contributor II

Hi  @Quinten,

You can consider creating a custom feature group to store the "weight" column during training. This way, you can keep the schema of the TrainingSet consistent between training and inference time.
Here are the steps you can follow:
  1. Create a new feature group with the same schema as your TrainingSet, but with an additional "weight" column.
  2. During training, join the TrainingSet with the new feature group to add the "weight" column.
  3. After training, you can drop the "weight" column from the TrainingSet using the drop_columns method provided by the FeatureStoreClient.
  4. During inference, you can use the original TrainingSet without the "weight" column.
    Here's some sample code to illustrate the steps:

     

     
    1 # Create a new feature group with the "weight" column
    2 weight_feature_group = fs.create_feature_group(
    3    name="weight_feature_group",
    4    table_name="weight_feature_group_table",
    5    primary_keys=["primary_key_column"],
    6    schema={
    7        "primary_key_column": "string",
    8        "weight": "double"
    9    }
    10)
    11
    12 # Join the TrainingSet with the new feature group during training
    13 training_set_with_weight = training_set.join(weight_feature_group, on="primary_key_column")
    14
    15 #Drop the "weight" column from the TrainingSet after training
    16 training_set = training_set.drop_columns(["weight"])
    17
    18 #Use the original TrainingSet without the "weight" column during inference
    19 inference_set = fs.get_historical_features(feature_group_names=["inference_feature_group"])
    This approach allows you to keep the schema of the TrainingSet consistent between training and inference time while still using the Feature Store.

Quinten
New Contributor II

Thanks for the response @KumaranT .

Unfortunately, training_set has no attribute 'join'. For that to work you would first need to load the df using training_set.load_df(). However, this dataframe contains no primary keys, thus joining on keys is not possible. Or am I missing something?

I created a work-around by joining on the index, but it is not a clean solution.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group