We are trying to build an internal use case based on PySpark. The data we have requires a lot of pre-processing. Hence, to cater to that we have used custom Spark ML pipeline stages as some of the transformations that need to be done on our data aren't available in the pyspark.ml module. These custom pre-processing stages were extending the Estimators, HasInput, HasOutput, MLWritable and MLReadable classes i.e.,
from pyspark.ml.pipeline import Transformer, Estimator
from pyspark.ml.param.shared import HasInputCol, HasOutputCol
We were able to tune it using hyperOpt and train-evaluate on the whole data. We also logged the model within MLflow. However, when we tried to load the pipeline model for inferring, it was failing due to the custom stages' __init__() method. We are not able to understand why upon loading the model the constructor method is called even if the class variables were already fitted within the object during the training (fitting) phase.
Here's some part of the custom transformer, which is having issues:
Here's the screenshot of the error we are facing:
If there's anyone who has worked on this kind of development. Please help! It would be great if someone can share some working examples to do that.