Databricks Community

danielvdc · ‎11-27-2024

I am performing a time series analysis, using a XGBoostRegressor with rolling predictions. I am doing so using the FeatureEngineeringClient (in combination with Unity Catalog), where I create and load in my features during training and inference, as such:

# Create seasonality features table for revenue
fe.create_table(
  name=fe_table_name_seasonality_revenue,
  primary_keys=["pm_key1","pm_key2","ts_key"],
  timestamp_keys=["ts_key"],
  df=revenue_seasonality_features_sdf,
  description="Revenue Seasonality features",
)

One of the features I am creating in the feature store is a lagged feature of my target (y). I have found that lagged information regarding my target is highly predictive for my target. This feature is created as such:

def create_lag_from_to_spark(df, var_name='y'):
    '''Create lag features for XGBoost with Spark - make sure amount of lags is not greater than prediction period'''
    
    # Create a window specification to partition by pm_key1 and order by ts_key
    window_spec = Window.partitionBy("pm_key1").orderBy("ts_key")
    
    # Loop over the lag range
    for lag in range(start, end):
        # Use the lag function to create lag columns
        df = df.withColumn(f'{var_name}_lag_{lag}', F.lag(var_name, lag).over(window_spec))
    
    return df

I am however unable to update these lagged features during inference when calling fe.score_batch(). The function called during score_batch() only works for data present in my train data, but not for the future values in my inference data. Future dates in my test set are not considered with this function. Ideally, I would edit the score_batch function below so that it re-calculates the features after every prediction.

# Predict on validation and test set
predictions_sdf = fe.score_batch(
    model_uri=f"models:/{full_model_name}/{model_reference.version}", 
    df=filtered_test_sdf)

How can I ensure rolling predictions, so that the predicted value in score_batch is recursively used to fill in my target and subsequently the lag features based on my target? I have been looking into UDFs, but based on my testing it seems like these also need to be based on data that is already present in the train set and also do not support rolling/recursive implementations based on predictions.

Databricks Community

Rolling predictions with FeatureEngineeringClient

Join Us as a Local Community Builder!

🚀 Weekly Delta (8 - 14 October): A Look Back at This Week’s Top Community Highlights

Databricks Community Champion - September 2025 - Nayanjyoti Sonowal

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

🌟 Community Sparks of the Week | September 26 – October 2 🌟

Solution Accelerator Series | #4 - Toxicity Detection for Gaming