Cannot log SparkML model to Unity Catalog due to missing output signature
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-12-2024 12:06 PM
I am training Spark ML model (concretely a SynapseML LightGBM ) in Databricks using mlflow and autolog
When I try to register my model in Unity catalog I get the following error:
MlflowException: Model passed for registration contained a signature that includes only inputs. All models in the Unity Catalog must be logged with a model signature containing both input and output type specifications
After some research I found mlflow autologger correctly infers my model input signature but leaves the model output empty, which is needed for registering the model in UC.
I was able to circumvent this by using the following code to set my signature manually:
from mlflow.models import ModelSignature
model_uri=f"runs:/{mlflow.active_run().info.run_id}/model"
model_info = mlflow.models.get_model_info(model_uri)
signature_dict = model_info.signature.to_dict()
signature_dict["outputs"] = '[{"type": "double", "name": "prediction", "required": false}]'
new_signature = ModelSignature.from_dict(signature_dict)
mlflow.models.set_signature(model_uri, new_signature)
This seems to work but feels hacky and too manual. Is there a way to make mlflow autologger correctly infer and register the model output signature and avoid this additional manual signature setup?
Has anyone found a more elegant solution?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-15-2024 11:21 AM - edited 07-15-2024 11:26 AM
Hi @Retired_mod, I'm using mlflow-skinny[databricks]==2.14.3 in a Databricks cluster with DBR 13.3 LTS.
I have tried training a model with the following libraries:
- Spark MLlib: does not log any signature at all (you can find the snippet to reproduce here)
- SynapseML LightGBM: logs a input signature but not an output
- scikit-learn: logs a signature with both input and output. However the output signature seems to be a Tensor based signature, which I thought was meant for Deep Learning use cases even though my example is a simple iris dataset regression model
Here goes the sklearn example:
import mlflow
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
print(f"MLFLOW version is: {mlflow.__version__}\n")
mlflow.autolog(exclusive=False)
with mlflow.start_run():
# Train a sklearn model on the iris dataset
X, y = datasets.load_iris(return_X_y=True, as_frame=True)
clf = RandomForestClassifier(max_depth=7)
clf.fit(X, y)
model_info = mlflow.models.get_model_info(f"runs:/{mlflow.active_run().info.run_id}/model")
print("Model signature:")
print(model_info.signature)
Output:
MLFLOW version is: 2.14.3
Model signature:
inputs:
['sepal length (cm)': double (required), 'sepal width (cm)': double (required), 'petal length (cm)': double (required), 'petal width (cm)': double (required)]
outputs:
[Tensor('int64', (-1,))]
params:
None