Unable to Infer Spark ML Pipeline model when built using Custom Preprocessing Stages

Machine Learning

Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.

We are trying to build an internal use case based on PySpark. The data we have requires a lot of pre-processing. Hence, to cater to that we have used custom Spark ML pipeline stages as some of the transformations that need to be done on our data aren't available in the pyspark.ml module. These custom pre-processing stages were extending the Estimators, HasInput, HasOutput, MLWritable and MLReadable classes i.e.,

from pyspark.ml.pipeline import Transformer, Estimator
from pyspark.ml.param.shared import HasInputCol, HasOutputCol

We were able to tune it using hyperOpt and train-evaluate on the whole data. We also logged the model within MLflow. However, when we tried to load the pipeline model for inferring, it was failing due to the custom stages' __init__() method. We are not able to understand why upon loading the model the constructor method is called even if the class variables were already fitted within the object during the training (fitting) phase.

Here's some part of the custom transformer, which is having issues: