cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Model flavour using feature store model training log_model()

Edna
New Contributor II

Hi I'm have succesfully registered my model using the feature engineering client with the following codes:

with mlflow.start_run():
    # Calculate the ratio of negative class samples to positive class samples
    ratio = (len(y_train) - y_train.sum()) / y_train.sum()

    # Fit model
    xgb_model = xgb.XGBClassifier(scale_pos_weight=ratio)
    xgb_model.fit(X_train, y_train)

    fe.log_model(
      model=xgb_model,
      artifact_path=MODEL_NAME,
      flavor=mlflow.sklearn,
      training_set=training_set,
      registered_model_name=MODEL_NAME
    )

There are two questions:

1. Why is the model still shown as pyfunc in the model registry when the flavor I specified was mlflow.sklearn?

2.  Can I use the following codes for prediction:

model = mlflow.sklearn.load_model(model_version_uri)

# Predict with model
prob_pred = model.predict_proba(df)[:, 1]

or do I must use score_batch()? As I would need prediction to be probabilities instead of 1/0s.

Thanks!

#model_flavor #feature_store #score_batch #xgboost #sklearn

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

robbe
New Contributor III

@Ednaunfortunately it seems that the only way to load a model logged using the Feature Store client to perform batch scoring is by using using fe.score_batch(model_uri, df).

If you need to use the model to predict probabilities, then maybe you can log a custom pyfunc.ModelWrapper (https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#pyfunc-create-custom) and in the predict() function you return the result of model.predict_proba().

View solution in original post

4 REPLIES 4

Kumaran
Valued Contributor III

Hello @Edna 

Thank you for contacting Databricks community support.

MLflow allows you to save models using different "flavors," which are essentially different ways of serializing and deserializing models. When you specify flavor=mlflow.sklearn, you're telling MLflow to save the model using the scikit-learn flavor.

However, when you register the model in the model registry, MLflow will automatically create a pyfunc version of the model in addition to the scikit-learn version. This is because pyfunc is a generic flavor that can be used to load and serve models in a variety of environments, regardless of the flavor used to save the model.

So even though you specified flavor=mlflow.sklearn, the model will still be shown as pyfunc in the model registry. This is expected behavior and allows the model to be easily deployed in a variety of environments.

If you want to deploy the model using the scikit-learn flavor specifically, you can do so by specifying the flavor when you load the model from the registry. For example:

 

import mlflow
import xgboost as xgb

Load the model using the scikit-learn flavor
model = mlflow.sklearn.load_model(f"models:/{MODEL_NAME}/1")

Use the model to make predictions
predictions = model.predict(X_test)

In this example, mlflow.sklearn.load_model() is used to load the model using the scikit-learn flavor, even though the model is registered as a pyfunc in the model registry.

 

MiStankai
New Contributor II

Hi @Kumaran ! Thank you for this response! Unfortunately, I find that this same thing does not work with a Catboost Model, event though mlflow.catboost flavour is supported by MLFlow. Could you help me with this?
These are the libs I'm using:

%pip install 'catboost==1.2.5' -q 
%pip install 'databricks-feature-engineering==0.6.0' -q
%pip install 'mlflow==2.14.3' -q
%pip install 'shap==0.44.0' -q

I log the model with:

with mlflow.start_run():
fe.log_model(
model = model,
artifact_path = 'model',
flavor = mlflow.catboost,
training_set = training_set,
registered_model_name = model_uc_name,
signature = signature,
input_example = X_train.head(1)
)

I load it like this:

best_model = mlflow.catboost.load_model(model_uri)

 And I get this error:

MlflowException: Model does not have the "catboost" flavor.

And I need to use the FE client to use your cool Feature Lookups. Please help, I'd really apreciate it!

Cheers!!

robbe
New Contributor III

@Ednaunfortunately it seems that the only way to load a model logged using the Feature Store client to perform batch scoring is by using using fe.score_batch(model_uri, df).

If you need to use the model to predict probabilities, then maybe you can log a custom pyfunc.ModelWrapper (https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#pyfunc-create-custom) and in the predict() function you return the result of model.predict_proba().

Edna
New Contributor II

Thanks for your reply @robbe - yes I have created a custom pyfunc model which I can now use fe.score_batch() to return probabilities. Here is the code:

# Calculate the ratio of negative class samples to positive class samples
ratio = (len(y_train) - y_train.sum()) / y_train.sum()

# Fit model
xgb_model = xgb.XGBClassifier(scale_pos_weight=ratio, enable_categorical=True)
xgb_model.fit(X_train, y_train)

y_probs = xgb_model.predict_proba(X_test)
y_pred = pd.Series([1 if prob > 0.5 else 0 for prob in y_probs[:,1]], index=y_test.index)

class churnProbability(mlflow.pyfunc.PythonModel):
    def __init__(self, trained_model):
        self.model = trained_model

    def preprocess_result(self, model_input):
        return model_input

    def predict(self, context, model_input):
        processed_df = self.preprocess_result(model_input.copy())
        processed_df["utility_code"] = processed_df["utility_code"].astype("category")
        processed_df["payment_method_name"] = processed_df["payment_method_name"].astype("category")
        results = self.model.predict_proba(processed_df)
        return results[:,1]


pyfunc_model = churnProbability(xgb_model)

# End the current MLflow run and start a new one to log the new pyfunc model
mlflow.end_run()

with mlflow.start_run() as run:
    fe.log_model(
        model=pyfunc_model,
        artifact_path=MODEL_NAME,
        flavor=mlflow.pyfunc,
        training_set=training_set,
        registered_model_name=MODEL_NAME,
    )
    # Logging relevant metrics for experiment run comparison and for posterity
    mlflow.log_metrics({'Precision Score': precision_score(y_test, y_pred), 
                        'Recall Score': recall_score(y_test, y_pred), 
                        'ROC-AUC Score': roc_auc_score(y_test, y_pred)})
    
    # Storing artifacts and attaching the to the model run
    # mlflow.log_artifact(metrics_df.to_csv(index=False), "metrics_df.csv")
    f, axes = plt.subplots(1, 2, figsize=(20,5))
    plot_confusion_matrix(xgb_model, X_test, y_test, ax=axes[0])
    plot_roc_curve(xgb_model, X_test, y_test, ax=axes[1])
    mlflow.log_figure(f, 'confusion_matrix_roc_curve.png')

plt.close('all')

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group