<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Rolling predictions with FeatureEngineeringClient in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/rolling-predictions-with-featureengineeringclient/m-p/138027#M4416</link>
    <description>&lt;P class="qt3gz91 paragraph"&gt;You’re running into a fundamental limitation: score_batch does point‑in‑time feature lookups and batch scoring, but it doesn’t support recursive multi‑step forecasting where predictions update features for subsequent timesteps. Feature Store looks up precomputed features “as of” your timestamp, and won’t recalculate lagged target features from predictions inside the same call.&lt;/P&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;What score_batch can and can’t do&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Automatic feature lookup&lt;/STRONG&gt;: When a model is logged with Feature Engineering, score_batch retrieves the features it needs from the offline store and joins them to your input df (by primary and timestamp keys).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Point‑in‑time correctness&lt;/STRONG&gt;: If you declare a &lt;STRONG&gt;timestamp key&lt;/STRONG&gt; (Workspace Feature Store) or &lt;STRONG&gt;timeseries_columns&lt;/STRONG&gt; (Unity Catalog FE), the join is “as‑of” the timestamp—not an exact match—so you get the latest feature values up to that time.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Override behavior&lt;/STRONG&gt;: If you include one or more feature columns directly in df (the dataframe you pass to score_batch), those values are used instead of what’s stored in the Feature Store. This is key to enabling a rolling loop outside score_batch.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;No recursive updates&lt;/STRONG&gt;: score_batch won’t iteratively feed predictions back into the feature computation to update lagged target features for future rows. You must orchestrate that loop yourself.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Two viable patterns for rolling predictions&lt;/H3&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;1) Orchestrate a step‑by‑step loop around score_batch (stays inside Feature Store for lookups)&lt;/H4&gt;
&lt;P class="qt3gz91 paragraph"&gt;This pattern uses score_batch each step to get predictions, while you manage the lag features for the next step. It leverages the “override behavior” by passing your computed y&lt;EM&gt;lag&lt;/EM&gt;* columns in df, so Feature Store uses those rather than the stored values.&lt;/P&gt;
&lt;P class="qt3gz91 paragraph"&gt;High‑level approach:&lt;/P&gt;
&lt;OL class="qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Build a “future skeleton” df with keys (pm_key1, pm_key2) and future ts_key you want to predict.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Maintain a per‑entity state (e.g., deque of last y values) initialized from historical data to seed y&lt;EM&gt;lag&lt;/EM&gt;* for the first future step.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;For each future timestamp t:&lt;/P&gt;
&lt;UL class="qt3gz98 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Construct df_step containing keys and ts_key=t.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Add your computed lag columns y&lt;EM&gt;lag&lt;/EM&gt;{k} to df_step from the current state.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Call fe.score_batch(model_uri, df_step). Because df_step includes y&lt;EM&gt;lag&lt;/EM&gt;{k}, those values are used during prediction.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Update the per‑entity state by pushing the predicted ŷ(t).&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Proceed to the next timestamp.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;2) Do the full recursion inside a grouped Pandas UDF with a directly loaded model (bypasses score_batch)&lt;/H4&gt;
&lt;P class="qt3gz91 paragraph"&gt;If you prefer to avoid repeated Spark jobs per step, you can load the XGBoost model directly (for example via mlflow.xgboost.load_model) and run a stateful loop per entity with applyInPandas, generating predictions and updating lag features row‑by‑row. Caution: Feature Store‑packaged models aren’t meant to be loaded via mlflow.pyfunc for arbitrary predict() calls; use score_batch for FS models or log a second “native” model artifact specifically for programmatic inference.&lt;/P&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Alternative modeling strategies (to avoid recursion)&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Direct multi‑step models&lt;/STRONG&gt;: Train separate horizon‑specific models (t+1, t+2, …) so inference is non‑recursive and compatible with score_batch. This sidesteps the feedback loop into lag features.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Exogenous‑only features&lt;/STRONG&gt;: Use features that are available for future timesteps without needing the target y (e.g., calendars, promotions, covariates). Then score_batch is sufficient with time‑series tables and as‑of joins.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Key takeaways&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;score_batch can’t be “edited” to recalculate features between predictions; implement a loop around it and pass your lag features in df to override the stored values for each step.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;For a fully stateful recursive approach, use applyInPandas with a natively loaded model (log an additional non‑FS‑packaged artifact if needed).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Ensure point‑in‑time correctness by using timestamp keys/timeseries_columns in your feature tables, as you’re already doing with timestamp_keys=["ts_key"].&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;</description>
    <pubDate>Thu, 06 Nov 2025 19:07:52 GMT</pubDate>
    <dc:creator>stbjelcevic</dc:creator>
    <dc:date>2025-11-06T19:07:52Z</dc:date>
    <item>
      <title>Rolling predictions with FeatureEngineeringClient</title>
      <link>https://community.databricks.com/t5/machine-learning/rolling-predictions-with-featureengineeringclient/m-p/100260#M3809</link>
      <description>&lt;P&gt;I am performing a time series analysis, using a XGBoostRegressor with rolling predictions. I am doing so using the FeatureEngineeringClient (in combination with Unity Catalog), where I create and load in my features during training and inference, as such:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;# Create seasonality features table for revenue
fe.create_table(
  name=fe_table_name_seasonality_revenue,
  primary_keys=["pm_key1","pm_key2","ts_key"],
  timestamp_keys=["ts_key"],
  df=revenue_seasonality_features_sdf,
  description="Revenue Seasonality features",
)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;One of the features I am creating in the feature store is a lagged feature of my target (y). I have found that lagged information regarding my target is highly predictive for my target. This feature is created as such:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;def create_lag_from_to_spark(df, var_name='y'):
    '''Create lag features for XGBoost with Spark - make sure amount of lags is not greater than prediction period'''
    
    # Create a window specification to partition by pm_key1 and order by ts_key
    window_spec = Window.partitionBy("pm_key1").orderBy("ts_key")
    
    # Loop over the lag range
    for lag in range(start, end):
        # Use the lag function to create lag columns
        df = df.withColumn(f'{var_name}_lag_{lag}', F.lag(var_name, lag).over(window_spec))
    
    return df&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am however unable to update these lagged features during inference when calling fe.score_batch(). The function called during score_batch() only works for data present in my train data, but not for the future values in my inference data. Future dates in my test set are not considered with this function. Ideally, I would edit the score_batch function below so that it re-calculates the features after every prediction.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;# Predict on validation and test set
predictions_sdf = fe.score_batch(
    model_uri=f"models:/{full_model_name}/{model_reference.version}", 
    df=filtered_test_sdf)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;How can I ensure rolling predictions, so that the predicted value in score_batch is recursively used to fill in my target and subsequently the lag features based on my target? I have been looking into UDFs, but based on my testing it seems like these also need to be based on data that is already present in the train set and also do not support rolling/recursive implementations based on predictions.&lt;/P&gt;</description>
      <pubDate>Wed, 27 Nov 2024 15:46:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/rolling-predictions-with-featureengineeringclient/m-p/100260#M3809</guid>
      <dc:creator>danielvdc</dc:creator>
      <dc:date>2024-11-27T15:46:59Z</dc:date>
    </item>
    <item>
      <title>Re: Rolling predictions with FeatureEngineeringClient</title>
      <link>https://community.databricks.com/t5/machine-learning/rolling-predictions-with-featureengineeringclient/m-p/138027#M4416</link>
      <description>&lt;P class="qt3gz91 paragraph"&gt;You’re running into a fundamental limitation: score_batch does point‑in‑time feature lookups and batch scoring, but it doesn’t support recursive multi‑step forecasting where predictions update features for subsequent timesteps. Feature Store looks up precomputed features “as of” your timestamp, and won’t recalculate lagged target features from predictions inside the same call.&lt;/P&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;What score_batch can and can’t do&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Automatic feature lookup&lt;/STRONG&gt;: When a model is logged with Feature Engineering, score_batch retrieves the features it needs from the offline store and joins them to your input df (by primary and timestamp keys).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Point‑in‑time correctness&lt;/STRONG&gt;: If you declare a &lt;STRONG&gt;timestamp key&lt;/STRONG&gt; (Workspace Feature Store) or &lt;STRONG&gt;timeseries_columns&lt;/STRONG&gt; (Unity Catalog FE), the join is “as‑of” the timestamp—not an exact match—so you get the latest feature values up to that time.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Override behavior&lt;/STRONG&gt;: If you include one or more feature columns directly in df (the dataframe you pass to score_batch), those values are used instead of what’s stored in the Feature Store. This is key to enabling a rolling loop outside score_batch.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;No recursive updates&lt;/STRONG&gt;: score_batch won’t iteratively feed predictions back into the feature computation to update lagged target features for future rows. You must orchestrate that loop yourself.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Two viable patterns for rolling predictions&lt;/H3&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;1) Orchestrate a step‑by‑step loop around score_batch (stays inside Feature Store for lookups)&lt;/H4&gt;
&lt;P class="qt3gz91 paragraph"&gt;This pattern uses score_batch each step to get predictions, while you manage the lag features for the next step. It leverages the “override behavior” by passing your computed y&lt;EM&gt;lag&lt;/EM&gt;* columns in df, so Feature Store uses those rather than the stored values.&lt;/P&gt;
&lt;P class="qt3gz91 paragraph"&gt;High‑level approach:&lt;/P&gt;
&lt;OL class="qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Build a “future skeleton” df with keys (pm_key1, pm_key2) and future ts_key you want to predict.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Maintain a per‑entity state (e.g., deque of last y values) initialized from historical data to seed y&lt;EM&gt;lag&lt;/EM&gt;* for the first future step.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;For each future timestamp t:&lt;/P&gt;
&lt;UL class="qt3gz98 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Construct df_step containing keys and ts_key=t.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Add your computed lag columns y&lt;EM&gt;lag&lt;/EM&gt;{k} to df_step from the current state.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Call fe.score_batch(model_uri, df_step). Because df_step includes y&lt;EM&gt;lag&lt;/EM&gt;{k}, those values are used during prediction.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Update the per‑entity state by pushing the predicted ŷ(t).&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Proceed to the next timestamp.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;2) Do the full recursion inside a grouped Pandas UDF with a directly loaded model (bypasses score_batch)&lt;/H4&gt;
&lt;P class="qt3gz91 paragraph"&gt;If you prefer to avoid repeated Spark jobs per step, you can load the XGBoost model directly (for example via mlflow.xgboost.load_model) and run a stateful loop per entity with applyInPandas, generating predictions and updating lag features row‑by‑row. Caution: Feature Store‑packaged models aren’t meant to be loaded via mlflow.pyfunc for arbitrary predict() calls; use score_batch for FS models or log a second “native” model artifact specifically for programmatic inference.&lt;/P&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Alternative modeling strategies (to avoid recursion)&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Direct multi‑step models&lt;/STRONG&gt;: Train separate horizon‑specific models (t+1, t+2, …) so inference is non‑recursive and compatible with score_batch. This sidesteps the feedback loop into lag features.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Exogenous‑only features&lt;/STRONG&gt;: Use features that are available for future timesteps without needing the target y (e.g., calendars, promotions, covariates). Then score_batch is sufficient with time‑series tables and as‑of joins.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Key takeaways&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;score_batch can’t be “edited” to recalculate features between predictions; implement a loop around it and pass your lag features in df to override the stored values for each step.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;For a fully stateful recursive approach, use applyInPandas with a natively loaded model (log an additional non‑FS‑packaged artifact if needed).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Ensure point‑in‑time correctness by using timestamp keys/timeseries_columns in your feature tables, as you’re already doing with timestamp_keys=["ts_key"].&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Thu, 06 Nov 2025 19:07:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/rolling-predictions-with-featureengineeringclient/m-p/138027#M4416</guid>
      <dc:creator>stbjelcevic</dc:creator>
      <dc:date>2025-11-06T19:07:52Z</dc:date>
    </item>
  </channel>
</rss>

