cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Rolling predictions with FeatureEngineeringClient

danielvdc
Visitor

I am performing a time series analysis, using a XGBoostRegressor with rolling predictions. I am doing so using the FeatureEngineeringClient (in combination with Unity Catalog), where I create and load in my features during training and inference, as such:

 

 

# Create seasonality features table for revenue
fe.create_table(
  name=fe_table_name_seasonality_revenue,
  primary_keys=["pm_key1","pm_key2","ts_key"],
  timestamp_keys=["ts_key"],
  df=revenue_seasonality_features_sdf,
  description="Revenue Seasonality features",
)

 

 

One of the features I am creating in the feature store is a lagged feature of my target (y). I have found that lagged information regarding my target is highly predictive for my target. This feature is created as such:

 

 

def create_lag_from_to_spark(df, var_name='y'):
    '''Create lag features for XGBoost with Spark - make sure amount of lags is not greater than prediction period'''
    
    # Create a window specification to partition by pm_key1 and order by ts_key
    window_spec = Window.partitionBy("pm_key1").orderBy("ts_key")
    
    # Loop over the lag range
    for lag in range(start, end):
        # Use the lag function to create lag columns
        df = df.withColumn(f'{var_name}_lag_{lag}', F.lag(var_name, lag).over(window_spec))
    
    return df

 

 

I am however unable to update these lagged features during inference when calling fe.score_batch(). The function called during score_batch() only works for data present in my train data, but not for the future values in my inference data. Future dates in my test set are not considered with this function. Ideally, I would edit the score_batch function below so that it re-calculates the features after every prediction.

 

# Predict on validation and test set
predictions_sdf = fe.score_batch(
    model_uri=f"models:/{full_model_name}/{model_reference.version}", 
    df=filtered_test_sdf)

 

How can I ensure rolling predictions, so that the predicted value in score_batch is recursively used to fill in my target and subsequently the lag features based on my target? I have been looking into UDFs, but based on my testing it seems like these also need to be based on data that is already present in the train set and also do not support rolling/recursive implementations based on predictions.

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group