Databricks

Saeid_H · ‎03-22-2023

Dear community,

I want to basically store 2 pickle files during the training and model registry with my keras model. So that when I access the model from another workspace (using mlflow.set_registery_uri()) , these models can be accessed as well. The custom mlflow model that I am using is as follow:

class KerasModel(mlflow.pyfunc.PythonModel):
 
  def __init__(self, model, tokenizer_path, label_encoder_path):
    self.model = model
    self.tokenizer_path = tokenizer_path
    self.label_encoder_path = label_encoder_path
 
  def _load_tokenizer(self):
    return joblib.load(self.tokenizer_path)
 
  def _load_label_encoder(self):
    return joblib.load(self.label_encoder_path)
 
  def predict(self, context, input_data):
    y_pred = self.model.predict(input_data)
    return y_pred

and here is my training script:

import joblib
 
import mlflow
import mlflow.keras
import mlflow.tensorflow
 
from keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import LabelEncoder
 
import  keras
import tensorflow
 
# Load and preprocess data into train/test splits
X_train, y_train = get_training_data()
 
########
# do data preprocessing.....
########
 
tokenizer_artifact_path = "/dbfs/tmp/train/tokenizer.pkl"
joblib.dump(fitted_tokenizer, tokenizer_artifact_path)
 
label_encoder_artifact_path = "/dbfs/tmp/train/label_encoder.pkl"
joblib.dump(fitted_label_encoder, label_encoder_artifact_path)
 
 
with mlflow.start_run() as mlflow_run:
 
    # Fit keras model and log model
    ########
    # build keras model.....
    ########
    model, model_history = model.fit(X_train, y_train)
    mlflow.keras.log_model(model, "model")
    
 
    # log label encoder and tokenizer as artifact
    mlflow.log_artifact(tokenizer_artifact_path)
    mlflow.log_artifact(label_encoder_artifact_path)
         
    # Create a PyFunc model that uses the trained Keras model and label encoder
    pyfunc_model = KerasModel(model, tokenizer_artifact_path, label_encoder_artifact_path)
    mlflow.pyfunc.log_model("custom_model", python_model= pyfunc_model)
    
    # get mlflow artifact uri
    artifact_uri = mlflow_run.info.artifact_uri
    model_uri = artifact_uri + "/custom_model"
 
    # Register model to MLflow Model Registry if provided
    mlflow.set_registry_uri("my registery_uri")
    mlflow.register_model(model_uri, name="keras_clssification")

The problem is when I want to access this registered model from another workspace, I can load the model but not the pickle files and it throw the error that

FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/tmp/train/label_encoder.pkl'

I use the following code:

mlflow.set_registery_uri("my model_registery_uri")
model = mlflow.pyfunc.load_model("model_uri")
unwrapped_model = model.unwrap_python_model()
label_encoder = unwrapped_model._load_label_encoder()
tokenizer = unwrapped_model._load_tokenizer()

It works on the same work space since the path is recognizable. But on other workspace it has not access to this. My question is how to store these two pickle files with the model so that where ever the model goes, these files goes as well?

I have checked this solution here as well, unfortunately I could not understand it completely.

If you could post your answer with code I really appreciate!

With many thanks in advance!

Anonymous · ‎03-23-2023

@Saeid Hedayati :

To store the pickle files along with the MLflow model, you can include them as artifacts when logging the model. You can modify your training script as follows:

import joblib
 
import mlflow
import mlflow.keras
import mlflow.tensorflow
 
from keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import LabelEncoder
 
import keras
import tensorflow
 
# Load and preprocess data into train/test splits
X_train, y_train = get_training_data()
 
########
# do data preprocessing.....
########
 
tokenizer_artifact_path = "/dbfs/tmp/train/tokenizer.pkl"
joblib.dump(fitted_tokenizer, tokenizer_artifact_path)
 
label_encoder_artifact_path = "/dbfs/tmp/train/label_encoder.pkl"
joblib.dump(fitted_label_encoder, label_encoder_artifact_path)
 
with mlflow.start_run() as mlflow_run:
 
    # Fit keras model and log model
    ########
    # build keras model.....
    ########
    model, model_history = model.fit(X_train, y_train)
    mlflow.keras.log_model(model, "model")
    
    # log label encoder and tokenizer as artifacts
    mlflow.log_artifact(tokenizer_artifact_path)
    mlflow.log_artifact(label_encoder_artifact_path)
         
    # Create a PyFunc model that uses the trained Keras model and label encoder
    pyfunc_model = KerasModel(model, tokenizer_artifact_path, label_encoder_artifact_path)
    
    # Log the PyFunc model with artifacts
    mlflow.pyfunc.log_model(pyfunc_model, "custom_model", artifacts={
        "tokenizer": tokenizer_artifact_path,
        "label_encoder": label_encoder_artifact_path
    })
    
    # get mlflow artifact uri
    artifact_uri = mlflow_run.info.artifact_uri
    model_uri = artifact_uri + "/custom_model"
 
    # Register model to MLflow Model Registry if provided
    mlflow.set_registry_uri("my registery_uri")
    mlflow.register_model(model_uri, name="keras_clssification")

In the above code, the artifacts (i.e., the pickle files) are logged along with the PyFunc model using the

mlflow.pyfunc.log_model() method. The artifacts are specified as a dictionary where the keys are the names of the artifacts and the values are the paths to the artifact files.

To load the model and the artifacts in another workspace, you can use the following code:

import mlflow.pyfunc
import joblib
 
# Load the model from the MLflow Model Registry
mlflow.set_registry_uri("my model_registry_uri")
model = mlflow.pyfunc.load_model("model_uri")
 
# Load the artifacts
tokenizer_path = model.metadata['signature_def']['serving_default']['inputs']['tokenizer'].string_value
label_encoder_path = model.metadata['signature_def']['serving_default']['inputs']['label_encoder'].string_value
tokenizer = joblib.load(tokenizer_path)
label_encoder = joblib.load(label_encoder_path)
 
# Get the PyFunc model and predict on new data
unwrapped_model = model._get_unwrapped_model()
y_pred = unwrapped_model.predict(input_data)

In the above code, we load the model and then extract the paths to the artifacts from the model metadata. We then load the artifacts using joblib.load() and use them to predict on new data.

Saeid_H · ‎03-25-2023

Hi @Suteja Kanuri ,

Thank you for the solution. I have also noticed that instead of passing pickle files path, I can pass the fitted_tokenizer and fitted_label_encoder object directly to the KerasModel class. This solution worked for me. But yours also looks correct!

Anonymous · ‎04-01-2023

@Saeid Hedayati :

Yes, that's another way to pass the tokenizer and label encoder objects directly to the KerasModel class instead of passing their paths. I'm glad to hear that the solution worked for you! Let me know if you have any other questions.

Saeid_H · ‎04-03-2023

thank you @Suteja Kanuri for your support, really appreciated!

Kaniz · ‎03-25-2023

Hi @Saeid Hedayati, We can build a thriving shared knowledge and insights community. Come back and mark the best answers to contribute to our ongoing pursuit of excellence.

arzex · ‎04-03-2023

آموزش تولید محتوا

Databricks

Register mlflow custom model, which has pickle files

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI