cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

Register mlflow custom model, which has pickle files

Saeid_H
Contributor

Dear community,

I want to basically store 2 pickle files during the training and model registry with my keras model. So that when I access the model from another workspace (using mlflow.set_registery_uri()) , these models can be accessed as well. The custom mlflow model that I am using is as follow:

class KerasModel(mlflow.pyfunc.PythonModel):
 
  def __init__(self, model, tokenizer_path, label_encoder_path):
    self.model = model
    self.tokenizer_path = tokenizer_path
    self.label_encoder_path = label_encoder_path
 
  def _load_tokenizer(self):
    return joblib.load(self.tokenizer_path)
 
  def _load_label_encoder(self):
    return joblib.load(self.label_encoder_path)
 
  def predict(self, context, input_data):
    y_pred = self.model.predict(input_data)
    return y_pred

and here is my training script:

import joblib
 
import mlflow
import mlflow.keras
import mlflow.tensorflow
 
from keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import LabelEncoder
 
import  keras
import tensorflow
 
# Load and preprocess data into train/test splits
X_train, y_train = get_training_data()
 
########
# do data preprocessing.....
########
 
tokenizer_artifact_path = "/dbfs/tmp/train/tokenizer.pkl"
joblib.dump(fitted_tokenizer, tokenizer_artifact_path)
 
label_encoder_artifact_path = "/dbfs/tmp/train/label_encoder.pkl"
joblib.dump(fitted_label_encoder, label_encoder_artifact_path)
 
 
with mlflow.start_run() as mlflow_run:
 
    # Fit keras model and log model
    ########
    # build keras model.....
    ########
    model, model_history = model.fit(X_train, y_train)
    mlflow.keras.log_model(model, "model")
    
 
    # log label encoder and tokenizer as artifact
    mlflow.log_artifact(tokenizer_artifact_path)
    mlflow.log_artifact(label_encoder_artifact_path)
         
    # Create a PyFunc model that uses the trained Keras model and label encoder
    pyfunc_model = KerasModel(model, tokenizer_artifact_path, label_encoder_artifact_path)
    mlflow.pyfunc.log_model("custom_model", python_model= pyfunc_model)
    
    # get mlflow artifact uri
    artifact_uri = mlflow_run.info.artifact_uri
    model_uri = artifact_uri + "/custom_model"
 
    # Register model to MLflow Model Registry if provided
    mlflow.set_registry_uri("my registery_uri")
    mlflow.register_model(model_uri, name="keras_clssification")

The problem is when I want to access this registered model from another workspace, I can load the model but not the pickle files and it throw the error that

FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/tmp/train/label_encoder.pkl'

I use the following code:

mlflow.set_registery_uri("my model_registery_uri")
model = mlflow.pyfunc.load_model("model_uri")
unwrapped_model = model.unwrap_python_model()
label_encoder = unwrapped_model._load_label_encoder()
tokenizer = unwrapped_model._load_tokenizer()

It works on the same work space since the path is recognizable. But on other workspace it has not access to this. My question is how to store these two pickle files with the model so that where ever the model goes, these files goes as well?

I have checked this solution here as well, unfortunately I could not understand it completely.

If you could post your answer with code I really appreciate!

With many thanks in advance!

   

6 REPLIES 6

Anonymous
Not applicable

@Saeid Hedayati​ :

To store the pickle files along with the MLflow model, you can include them as artifacts when logging the model. You can modify your training script as follows:

import joblib
 
import mlflow
import mlflow.keras
import mlflow.tensorflow
 
from keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import LabelEncoder
 
import keras
import tensorflow
 
# Load and preprocess data into train/test splits
X_train, y_train = get_training_data()
 
########
# do data preprocessing.....
########
 
tokenizer_artifact_path = "/dbfs/tmp/train/tokenizer.pkl"
joblib.dump(fitted_tokenizer, tokenizer_artifact_path)
 
label_encoder_artifact_path = "/dbfs/tmp/train/label_encoder.pkl"
joblib.dump(fitted_label_encoder, label_encoder_artifact_path)
 
with mlflow.start_run() as mlflow_run:
 
    # Fit keras model and log model
    ########
    # build keras model.....
    ########
    model, model_history = model.fit(X_train, y_train)
    mlflow.keras.log_model(model, "model")
    
    # log label encoder and tokenizer as artifacts
    mlflow.log_artifact(tokenizer_artifact_path)
    mlflow.log_artifact(label_encoder_artifact_path)
         
    # Create a PyFunc model that uses the trained Keras model and label encoder
    pyfunc_model = KerasModel(model, tokenizer_artifact_path, label_encoder_artifact_path)
    
    # Log the PyFunc model with artifacts
    mlflow.pyfunc.log_model(pyfunc_model, "custom_model", artifacts={
        "tokenizer": tokenizer_artifact_path,
        "label_encoder": label_encoder_artifact_path
    })
    
    # get mlflow artifact uri
    artifact_uri = mlflow_run.info.artifact_uri
    model_uri = artifact_uri + "/custom_model"
 
    # Register model to MLflow Model Registry if provided
    mlflow.set_registry_uri("my registery_uri")
    mlflow.register_model(model_uri, name="keras_clssification")

In the above code, the artifacts (i.e., the pickle files) are logged along with the PyFunc model using the

mlflow.pyfunc.log_model() method. The artifacts are specified as a dictionary where the keys are the names of the artifacts and the values are the paths to the artifact files.

To load the model and the artifacts in another workspace, you can use the following code:

import mlflow.pyfunc
import joblib
 
# Load the model from the MLflow Model Registry
mlflow.set_registry_uri("my model_registry_uri")
model = mlflow.pyfunc.load_model("model_uri")
 
# Load the artifacts
tokenizer_path = model.metadata['signature_def']['serving_default']['inputs']['tokenizer'].string_value
label_encoder_path = model.metadata['signature_def']['serving_default']['inputs']['label_encoder'].string_value
tokenizer = joblib.load(tokenizer_path)
label_encoder = joblib.load(label_encoder_path)
 
# Get the PyFunc model and predict on new data
unwrapped_model = model._get_unwrapped_model()
y_pred = unwrapped_model.predict(input_data)

In the above code, we load the model and then extract the paths to the artifacts from the model metadata. We then load the artifacts using joblib.load() and use them to predict on new data.

Hi @Suteja Kanuri​ ,

Thank you for the solution. I have also noticed that instead of passing pickle files path, I can pass the fitted_tokenizer and fitted_label_encoder object directly to the KerasModel class. This solution worked for me. But yours also looks correct!

Anonymous
Not applicable

@Saeid Hedayati​ :

Yes, that's another way to pass the tokenizer and label encoder objects directly to the KerasModel class instead of passing their paths. I'm glad to hear that the solution worked for you! Let me know if you have any other questions.

thank you @Suteja Kanuri​  for your support, really appreciated!

Kaniz
Community Manager
Community Manager

Hi @Saeid Hedayati​, We can build a thriving shared knowledge and insights community. Come back and mark the best answers to contribute to our ongoing pursuit of excellence.

arzex
New Contributor II
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.