Databricks Community

mharrison · yesterday

Hi,

Context

I'm looking for help trying to get Unity Catalog Feature Lookup to work with my model how I need it to.

I have a trained darts time series model that takes as input to its `.predict()` method both the history of the variable in question, and `n` the number of time steps to forecast ahead from the end of that history.

So if I have a time series of the history of daily widget sales, up to 30th November. And pass `n=7`, I'd get predictions for the period 1st-7th December.

The model is a darts Regression Model that uses some number of the past observations of the target variable as features for predicting future values. Darts handles this feature construction though, you just need to provide it the target variable history in the `predict()` call. (As, at inference time, that history will have evolved since training time, but we still want to use the same trained model, just with the updated history).

Databricks

On Databricks, I have the daily widget sales volumes set up as a timeseries Feature Table. And I want to use the Feature Lookup to populate the required history at inference time, that can then be passed to the model.

I wrapped my trained Darts model into an MLFlow Pyfunc PythonModel, whose `predict()` method similarly takes a Pandas DataFrame containing the target variable history, and passes that to the underlying Darts model. I then logged this model, along with the Feature Table lookup using the Databricks Feature Engineering client. So that the looking up the history could be handled by the Feature Lookup behind-the-scenes at inference time.

The issue I'm having is that, now trying to use this model for inference, this means I have to provide the set of dates I want the feature lookup to run on to the model -- which in this case would be dates in the past. Say, if I'm running inference on 1st December, and I need to pass the dates 1st November-30th November (as the model uses 30 days' past values of widget sales as it features) to the Feature Engineering client's `score_batch()` method.

This will then successfully perform the Feature Table lookup, to get the widget sales volumes for those dates, and pass that to the Darts model. But the predictions produced by the Darts model are for a different set of dates, i.e. 1st-7th December.

So I'm getting the following error when I try this:

pyspark.errors.exceptions.base.PySparkRuntimeError: [RESULT_LENGTH_MISMATCH_FOR_SCALAR_ITER_PANDAS_UDF] The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was 7 and the length of input was 30.

I guess because I passed 30 rows to the `score_batch()` method (i.e. the dates I needed for the target value history lookup), and only got back 7 rows as predictions (due to the chosen forecast horizon). But Databricks expects this number to be equal, and wants to ascribe each prediction back to a single row of the input to `score_batch()`?

I don't get this issue if I happen to also set the model to only need the same length of history as the desired forecast horizon -- i.e. if the model only uses 7 days' past history to predict 7 days into the future. However, even there there's an issue, since it assigns the 7 predictions to the 7 original input rows, and so the associated dates are incorrect. So the prediction for 1st December is assigned to the input row for 24th November, etc.

But that's not a sustainable solution anyway, since in general the length of history used and forecast horizon will be different.

So is there a way to get the Feature Lookups to do what I need? Which is something like:

* Lookup the widget sales volume history, as-of a particular point in time. (So as-of 1st December, give me all the data up to 30th November inclusive).

* Pass that history to the Darts model's predict method, which will then return projections for the future, from the end date of that history. (I.e. 1st-7th December, if forecast horizon 7).

* Return the predictions, and (if possible) also the history as a separate DataFrame, back to the user when they call `score_batch()`.

I considered trying something like table-valued functions, to look up the history, but as far as I can tell, the Feature Lookup stuff on Unity Catalog is only for scalar valued functions (e.g. the docs here say UDTF's can't be registered on Unity Catalog).

Walter_C · yesterday

The issue you're encountering arises because the score_batch() method expects the length of the output to match the length of the input. However, your model's forecast horizon (7 days) is shorter than the history length (30 days) you are providing for the feature lookup, leading to a mismatch.

To address this, you can modify your approach to ensure that the lengths match. Here are the steps you can follow:

Prepare the Input for the Model:
Create a DataFrame containing the historical data and the dates for which you want to make predictions (e.g., 1st December to 7th December). This DataFrame should have the same number of rows as the forecast horizon.
Modify the predict() Method: Adjust the predict() method of your MLFlow Pyfunc PythonModel to handle the input DataFrame correctly. The method should:
- Extract the historical data from the input DataFrame.
- Use the historical data to make predictions for the forecast horizon.
- Return a DataFrame with the predictions, ensuring the number of rows matches the input DataFrame.
Ensure Consistent Output Length: Make sure the output DataFrame from the predict() method has the same number of rows as the input DataFrame passed to score_batch(). This will prevent the length mismatch error.

mharrison · yesterday

Thanks for your response. It sounds like the 2nd approach is best for me, modifying the `predict()` method to perform the required history lookup.

Is it possible to do this via the Feature Engineering client within that method, or should I simply query the Unity Catalog directly? I.e. something like:

query = """"
SELECT date, widget_sales
FROM my_catalog.my_schema.widget_sales
WHERE date < as_of_date
"""
sdf_widget_sales_history = spark.sql(query)
[...]

Which seems straightforward enough, to be fair. Though I guess in this case, I take it I shouldn't log the model using the Feature Engineering client, and instead just log via MLFlow directly? As the client seems to want either a TrainingSet object or FeatureSpec, which wouldn't be applicable here, as I'm DIY'ing the history lookup?

Databricks Community

Feature Lookup Help

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences