cancel
Showing results for 
Search instead for 
Did you mean: 
Knowledge Sharing Hub
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results for 
Search instead for 
Did you mean: 

Log Custom Transformer with Feature Engineering Client

WarrenO
New Contributor III

Hi everyone,

I'm building a Pyspark ML Pipeline where the first stage is to fill nulls with zero. I wrote a custom class to do this since I cannot find a Transformer that will do this imputation. 

I am able to log this pipeline using ML Flow log model method and load it for scoring but when I log it with the Feature Engineering package, the score batch method throws an error saying that the custom class does not exist. I need to log it via the Feature Engineering package so I can properly leverage featurestores and the lineage in unity catalog. Is anyone able to help? The sample pipeline code is below. inputs are loaded using feature lookups and the "create training set" method

All assistance is appreciated!

 

from pyspark.ml import Transformer
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
from pyspark.ml.regression import  RandomForestRegressor

class FillNA(Transformer, DefaultParamsReadable, DefaultParamsWritable, mlflow.pyfunc.PythonModel😞
    def __init__(self, fill_value=0, inputCols=None😞
        super(FillNA, self).__init__()
        self.fill_value = fill_value
        self.inputCols = inputCols

    def _transform(self, df😞
        return df.fillna(self.fill_value, subset=self.inputCols)

    def predict(self, context, model_input, params=None😞
        return self._transform(model_input)
   
fill_na = FillNA(fill_value=0.0, inputCols=cols_to_fill)
rfr = RandomForestRegressor()
pipeline = Pipeline(stages=[fill_na,rfr])
1 ACCEPTED SOLUTION

Accepted Solutions

koji_kawamura
Databricks Employee
Databricks Employee

Hi @WarrenO , thanks for sharing that with the detailed code!

I was able to reproduce the error, specifically the following error:

AttributeError: module '__main__' has no attribute 'CustomAdder'
File <command-1315887242804075>, line 39
35 evaluator = RegressionEvaluator(
36 labelCol="alcohol", predictionCol="prediction")
38 # Log metrics
---> 39 rmse = evaluator.evaluate(predictions, {evaluator.metricName: "rmse"})
41 # Log model metrics
42 mlflow.log_metric("root_mean_squared_error", rmse)

I did some research internally and found that the similar issue has been reported and confirmed that custom classes are not supported with feature store score_batch currently unfortunately. The reason is FeatureEngineeringClient score_batch execute the transform using remote UDFs but workers cannot load the custom class definitions there. And there's no way to manually specify additional dependencies with FeatureEngineeringClient's log_model. We need something like PyFunc flavor's additional code_path parameter, but it's not available here.

I will share with the product team that this feature is demanded to implement end-to-end feature management. I hope it can make a difference. Thanks again for reporting!

View solution in original post

1 REPLY 1

koji_kawamura
Databricks Employee
Databricks Employee

Hi @WarrenO , thanks for sharing that with the detailed code!

I was able to reproduce the error, specifically the following error:

AttributeError: module '__main__' has no attribute 'CustomAdder'
File <command-1315887242804075>, line 39
35 evaluator = RegressionEvaluator(
36 labelCol="alcohol", predictionCol="prediction")
38 # Log metrics
---> 39 rmse = evaluator.evaluate(predictions, {evaluator.metricName: "rmse"})
41 # Log model metrics
42 mlflow.log_metric("root_mean_squared_error", rmse)

I did some research internally and found that the similar issue has been reported and confirmed that custom classes are not supported with feature store score_batch currently unfortunately. The reason is FeatureEngineeringClient score_batch execute the transform using remote UDFs but workers cannot load the custom class definitions there. And there's no way to manually specify additional dependencies with FeatureEngineeringClient's log_model. We need something like PyFunc flavor's additional code_path parameter, but it's not available here.

I will share with the product team that this feature is demanded to implement end-to-end feature management. I hope it can make a difference. Thanks again for reporting!