Databricks Community

WarrenO · ‎03-06-2025

Hi everyone,

I'm building a Pyspark ML Pipeline where the first stage is to fill nulls with zero. I wrote a custom class to do this since I cannot find a Transformer that will do this imputation.

I am able to log this pipeline using ML Flow log model method and load it for scoring but when I log it with the Feature Engineering package, the score batch method throws an error saying that the custom class does not exist. I need to log it via the Feature Engineering package so I can properly leverage featurestores and the lineage in unity catalog. Is anyone able to help? The sample pipeline code is below. inputs are loaded using feature lookups and the "create training set" method

All assistance is appreciated!

from pyspark.ml import Transformer

from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable

from pyspark.ml.regression import RandomForestRegressor

class FillNA(Transformer, DefaultParamsReadable, DefaultParamsWritable, mlflow.pyfunc.PythonModel😞

def __init__(self, fill_value=0, inputCols=None😞

super(FillNA, self).__init__()

self.fill_value = fill_value

self.inputCols = inputCols

def _transform(self, df😞

return df.fillna(self.fill_value, subset=self.inputCols)

def predict(self, context, model_input, params=None😞

return self._transform(model_input)

fill_na = FillNA(fill_value=0.0, inputCols=cols_to_fill)

rfr = RandomForestRegressor()

pipeline = Pipeline(stages=[fill_na,rfr])

koji_kawamura · ‎03-07-2025

Hi @WarrenO , thanks for sharing that with the detailed code!

I was able to reproduce the error, specifically the following error:

AttributeError: module '__main__' has no attribute 'CustomAdder'
File <command-1315887242804075>, line 39
35 evaluator = RegressionEvaluator(
36 labelCol="alcohol", predictionCol="prediction")
38 # Log metrics
---> 39 rmse = evaluator.evaluate(predictions, {evaluator.metricName: "rmse"})
41 # Log model metrics
42 mlflow.log_metric("root_mean_squared_error", rmse)

I did some research internally and found that the similar issue has been reported and confirmed that custom classes are not supported with feature store score_batch currently unfortunately. The reason is FeatureEngineeringClient score_batch execute the transform using remote UDFs but workers cannot load the custom class definitions there. And there's no way to manually specify additional dependencies with FeatureEngineeringClient's log_model. We need something like PyFunc flavor's additional code_path parameter, but it's not available here.

I will share with the product team that this feature is demanded to implement end-to-end feature management. I hope it can make a difference. Thanks again for reporting!

View solution in original post

koji_kawamura · ‎03-07-2025

Hi @WarrenO , thanks for sharing that with the detailed code!

I was able to reproduce the error, specifically the following error:

AttributeError: module '__main__' has no attribute 'CustomAdder'
File <command-1315887242804075>, line 39
35 evaluator = RegressionEvaluator(
36 labelCol="alcohol", predictionCol="prediction")
38 # Log metrics
---> 39 rmse = evaluator.evaluate(predictions, {evaluator.metricName: "rmse"})
41 # Log model metrics
42 mlflow.log_metric("root_mean_squared_error", rmse)

I did some research internally and found that the similar issue has been reported and confirmed that custom classes are not supported with feature store score_batch currently unfortunately. The reason is FeatureEngineeringClient score_batch execute the transform using remote UDFs but workers cannot load the custom class definitions there. And there's no way to manually specify additional dependencies with FeatureEngineeringClient's log_model. We need something like PyFunc flavor's additional code_path parameter, but it's not available here.

I will share with the product team that this feature is demanded to implement end-to-end feature management. I hope it can make a difference. Thanks again for reporting!