Cannot log SparkML model to Unity Catalog due to m...

migq2 · ‎07-12-2024

I am training Spark ML model (concretely a SynapseML LightGBM ) in Databricks using mlflow and autolog

When I try to register my model in Unity catalog I get the following error:

MlflowException: Model passed for registration contained a signature that includes only inputs. All models in the Unity Catalog must be logged with a model signature containing both input and output type specifications

After some research I found mlflow autologger correctly infers my model input signature but leaves the model output empty, which is needed for registering the model in UC.

I was able to circumvent this by using the following code to set my signature manually:

from mlflow.models import ModelSignature
model_uri=f"runs:/{mlflow.active_run().info.run_id}/model"
model_info = mlflow.models.get_model_info(model_uri)

signature_dict = model_info.signature.to_dict()
signature_dict["outputs"] =  '[{"type": "double", "name": "prediction", "required": false}]'

new_signature = ModelSignature.from_dict(signature_dict)
mlflow.models.set_signature(model_uri, new_signature)

This seems to work but feels hacky and too manual. Is there a way to make mlflow autologger correctly infer and register the model output signature and avoid this additional manual signature setup?

Has anyone found a more elegant solution?

migq2 · ‎07-15-2024

Hi @Retired_mod, I'm using mlflow-skinny[databricks]==2.14.3 in a Databricks cluster with DBR 13.3 LTS.

I have tried training a model with the following libraries:

Spark MLlib: does not log any signature at all (you can find the snippet to reproduce here)
SynapseML LightGBM: logs a input signature but not an output
scikit-learn: logs a signature with both input and output. However the output signature seems to be a Tensor based signature, which I thought was meant for Deep Learning use cases even though my example is a simple iris dataset regression model

Here goes the sklearn example:

import mlflow
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier

print(f"MLFLOW version is: {mlflow.__version__}\n")

mlflow.autolog(exclusive=False)

with mlflow.start_run():
    # Train a sklearn model on the iris dataset
    X, y = datasets.load_iris(return_X_y=True, as_frame=True)
    clf = RandomForestClassifier(max_depth=7)
    clf.fit(X, y)
    
    model_info = mlflow.models.get_model_info(f"runs:/{mlflow.active_run().info.run_id}/model")
    
    print("Model signature:")
    print(model_info.signature)

Output:

MLFLOW version is: 2.14.3

Model signature:
inputs: 
  ['sepal length (cm)': double (required), 'sepal width (cm)': double (required), 'petal length (cm)': double (required), 'petal width (cm)': double (required)]
outputs: 
  [Tensor('int64', (-1,))]
params: 
  None

Cannot log SparkML model to Unity Catalog due to missing output signature