cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Pyspark custom Transformer class -AttributeError: 'DummyMod' object has no attribute 'MyTransformer'

simranisanewbie
New Contributor

I am trying to create a custom transformer as a stage in my pipeline. A few of the transformations I am doing via SparkNLP and the next few using MLlib. To pass the result of SparkNLP transformation at a stage to the next MLlib transformation, I need to extract the spark_nlp_col.result column and pass it, and I am using a custom transformation stage for that.
After I fit my pipeline, I am able to persist it but when I am loading it back again, I am getting an error:

AttributeError: 'DummyMod' object has no attribute 'MyTransformer'

Here is my class:

 

from pyspark.ml import Transformer
from pyspark.ml.param.shared import Param,Params,TypeConverters

class MyTransformer(Transformer,DefaultParamsWritable,DefaultParamsReadable):
    inputCol = Param(Params._dummy(), "inputCol", "",TypeConverters.toString)
    outputCol = Param(Params._dummy(), "outputCol", "",TypeConverters.toString)

    def __init__(self,inputCol=None,outputCol=None):
        super(MyTransformer, self).__init__()
        self._setDefault(inputCol=None)
        self._set(inputCol = inputCol)
        self._setDefault(outputCol=None)
        self._set(outputCol = outputCol)

    def getInputCol(self):
        return self.getOrDefault(self.inputCol)

    def setInputCol(self, inputCol):
        self._set(inputCol=inputCol)

    def getOutputCol(self):
        return self.getOrDefault(self.outputCol)

    def setOutputCol(self, outputCol):
        self._set(outputCol=outputCol)

    def _transform(self, dataset):
        in_col = self.getInputCol()
        out_col = self.getOutputCol()

        final_in_col = in_col+".result"
        result = dataset.withColumn(out_col, dataset[final_in_col])
        return result

 

I have created a simple wrapper function over it for standardisation and then used it to create pipeline, fit and save it:

 

def extract_col(cols, in_suffix, out_suffix):
   return [MyTransformer(inputCol=col+in_suff, outputCol=col+out_suffix) for col in cols]
'''
stages before custom transformer
'''
extractors = extract_col(cols, "_in", "_out")
'''
stages after custom transformer
'''

stages = s1 + s2 + .. + extractors + .. + sn-1 + sn
pipeline = Pipeline(stages = stages)
fit_pipeline = pipeline.fit(data)
fit_pipeline.write().overwrite().save(path_to_store_at)

 

How I am reading it back:

 

saved_pipeline = PipelinModel.load("path_where_stored")

 

And then I am encountering the error.
I have tried multiple ways of writing the custom class, using HasInputCol, HasOutputCol, etc, nothing working so far. Any idea on how I can resolve it?

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group