I am performing a time series analysis, using a XGBoostRegressor with rolling predictions. I am doing so using the FeatureEngineeringClient (in combination with Unity Catalog), where I create and load in my features during training and inference, as such:
# Create seasonality features table for revenue
fe.create_table(
name=fe_table_name_seasonality_revenue,
primary_keys=["pm_key1","pm_key2","ts_key"],
timestamp_keys=["ts_key"],
df=revenue_seasonality_features_sdf,
description="Revenue Seasonality features",
)
One of the features I am creating in the feature store is a lagged feature of my target (y). I have found that lagged information regarding my target is highly predictive for my target. This feature is created as such:
def create_lag_from_to_spark(df, var_name='y'):
'''Create lag features for XGBoost with Spark - make sure amount of lags is not greater than prediction period'''
# Create a window specification to partition by pm_key1 and order by ts_key
window_spec = Window.partitionBy("pm_key1").orderBy("ts_key")
# Loop over the lag range
for lag in range(start, end):
# Use the lag function to create lag columns
df = df.withColumn(f'{var_name}_lag_{lag}', F.lag(var_name, lag).over(window_spec))
return df
I am however unable to update these lagged features during inference when calling fe.score_batch(). The function called during score_batch() only works for data present in my train data, but not for the future values in my inference data. Future dates in my test set are not considered with this function. Ideally, I would edit the score_batch function below so that it re-calculates the features after every prediction.
# Predict on validation and test set
predictions_sdf = fe.score_batch(
model_uri=f"models:/{full_model_name}/{model_reference.version}",
df=filtered_test_sdf)
How can I ensure rolling predictions, so that the predicted value in score_batch is recursively used to fill in my target and subsequently the lag features based on my target? I have been looking into UDFs, but based on my testing it seems like these also need to be based on data that is already present in the train set and also do not support rolling/recursive implementations based on predictions.