Databricks Community

Edna · ‎04-10-2024

Hi I'm have succesfully registered my model using the feature engineering client with the following codes:

with mlflow.start_run():
    # Calculate the ratio of negative class samples to positive class samples
    ratio = (len(y_train) - y_train.sum()) / y_train.sum()

    # Fit model
    xgb_model = xgb.XGBClassifier(scale_pos_weight=ratio)
    xgb_model.fit(X_train, y_train)

    fe.log_model(
      model=xgb_model,
      artifact_path=MODEL_NAME,
      flavor=mlflow.sklearn,
      training_set=training_set,
      registered_model_name=MODEL_NAME
    )

There are two questions:

1. Why is the model still shown as pyfunc in the model registry when the flavor I specified was mlflow.sklearn?

2. Can I use the following codes for prediction:

model = mlflow.sklearn.load_model(model_version_uri)

# Predict with model
prob_pred = model.predict_proba(df)[:, 1]

or do I must use score_batch()? As I would need prediction to be probabilities instead of 1/0s.

Thanks!

#model_flavor #feature_store #score_batch #xgboost #sklearn

robbe · ‎07-22-2024

@Ednaunfortunately it seems that the only way to load a model logged using the Feature Store client to perform batch scoring is by using using fe.score_batch(model_uri, df).

If you need to use the model to predict probabilities, then maybe you can log a custom pyfunc.ModelWrapper (https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#pyfunc-create-custom) and in the predict() function you return the result of model.predict_proba().

View solution in original post

Kumaran · ‎04-30-2024

Hello @Edna

Thank you for contacting Databricks community support.

MLflow allows you to save models using different "flavors," which are essentially different ways of serializing and deserializing models. When you specify flavor=mlflow.sklearn, you're telling MLflow to save the model using the scikit-learn flavor.

However, when you register the model in the model registry, MLflow will automatically create a pyfunc version of the model in addition to the scikit-learn version. This is because pyfunc is a generic flavor that can be used to load and serve models in a variety of environments, regardless of the flavor used to save the model.

So even though you specified flavor=mlflow.sklearn, the model will still be shown as pyfunc in the model registry. This is expected behavior and allows the model to be easily deployed in a variety of environments.

If you want to deploy the model using the scikit-learn flavor specifically, you can do so by specifying the flavor when you load the model from the registry. For example:

import mlflow
import xgboost as xgb

Load the model using the scikit-learn flavor
model = mlflow.sklearn.load_model(f"models:/{MODEL_NAME}/1")

Use the model to make predictions
predictions = model.predict(X_test)

In this example, mlflow.sklearn.load_model() is used to load the model using the scikit-learn flavor, even though the model is registered as a pyfunc in the model registry.

MiStankai · ‎07-19-2024

Hi @Kumaran ! Thank you for this response! Unfortunately, I find that this same thing does not work with a Catboost Model, event though mlflow.catboost flavour is supported by MLFlow. Could you help me with this?
These are the libs I'm using:

%pip install 'catboost==1.2.5' -q 
%pip install 'databricks-feature-engineering==0.6.0' -q 
%pip install 'mlflow==2.14.3' -q 
%pip install 'shap==0.44.0' -q

I log the model with:

with mlflow.start_run():
  fe.log_model(
    model = model, 
    artifact_path = 'model',
    flavor = mlflow.catboost,
    training_set = training_set,
    registered_model_name = model_uc_name,
    signature = signature,
    input_example = X_train.head(1)
)

I load it like this:

best_model = mlflow.catboost.load_model(model_uri)

And I get this error:

MlflowException: Model does not have the "catboost" flavor.

And I need to use the FE client to use your cool Feature Lookups. Please help, I'd really apreciate it!

Cheers!!

robbe · ‎07-22-2024

@Ednaunfortunately it seems that the only way to load a model logged using the Feature Store client to perform batch scoring is by using using fe.score_batch(model_uri, df).

If you need to use the model to predict probabilities, then maybe you can log a custom pyfunc.ModelWrapper (https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#pyfunc-create-custom) and in the predict() function you return the result of model.predict_proba().

Edna · ‎07-22-2024

Thanks for your reply @robbe - yes I have created a custom pyfunc model which I can now use fe.score_batch() to return probabilities. Here is the code:

# Calculate the ratio of negative class samples to positive class samples
ratio = (len(y_train) - y_train.sum()) / y_train.sum()

# Fit model
xgb_model = xgb.XGBClassifier(scale_pos_weight=ratio, enable_categorical=True)
xgb_model.fit(X_train, y_train)

y_probs = xgb_model.predict_proba(X_test)
y_pred = pd.Series([1 if prob > 0.5 else 0 for prob in y_probs[:,1]], index=y_test.index)

class churnProbability(mlflow.pyfunc.PythonModel):
    def __init__(self, trained_model):
        self.model = trained_model

    def preprocess_result(self, model_input):
        return model_input

    def predict(self, context, model_input):
        processed_df = self.preprocess_result(model_input.copy())
        processed_df["utility_code"] = processed_df["utility_code"].astype("category")
        processed_df["payment_method_name"] = processed_df["payment_method_name"].astype("category")
        results = self.model.predict_proba(processed_df)
        return results[:,1]


pyfunc_model = churnProbability(xgb_model)

# End the current MLflow run and start a new one to log the new pyfunc model
mlflow.end_run()

with mlflow.start_run() as run:
    fe.log_model(
        model=pyfunc_model,
        artifact_path=MODEL_NAME,
        flavor=mlflow.pyfunc,
        training_set=training_set,
        registered_model_name=MODEL_NAME,
    )
    # Logging relevant metrics for experiment run comparison and for posterity
    mlflow.log_metrics({'Precision Score': precision_score(y_test, y_pred), 
                        'Recall Score': recall_score(y_test, y_pred), 
                        'ROC-AUC Score': roc_auc_score(y_test, y_pred)})
    
    # Storing artifacts and attaching the to the model run
    # mlflow.log_artifact(metrics_df.to_csv(index=False), "metrics_df.csv")
    f, axes = plt.subplots(1, 2, figsize=(20,5))
    plot_confusion_matrix(xgb_model, X_test, y_test, ax=axes[0])
    plot_roc_curve(xgb_model, X_test, y_test, ax=axes[1])
    mlflow.log_figure(f, 'confusion_matrix_roc_curve.png')

plt.close('all')

Databricks Community

Model flavour using feature store model training log_model()

Photos

Join Us as a Local Community Builder!

Virtual Learning Festival: 9 April - 30 April

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Data + AI Summit 2025 — registration now open!

Databricks Community Champion - March 2025 - Takuya Omi