topic Re: log signature and input data for Spark LinearRegression in Get Started Discussions

log signature and input data for Spark LinearRegression

MohsenJ — Thu, 15 Feb 2024 16:48:03 GMT

I am looking for a way to log my `pyspark.ml.regression.LinearRegression` model with input and signature ata. The usual example that I found around are using sklearn and they can simply do

# Log the model with signature and input example signature = infer_signature(X_train, pd.DataFrame(y_train)) input_example = X_train.head(3) mlflow.sklearn.log_model(rf_model, "rf_model", signature=signature, input_example=input_example)

but it doesn't work for `LinearRegression` because the "feature" column should be Spark `VectorUDT ` and it is not support by mlflow. This is how I generate my feature column

cat_input_cols = ["gender","occupation","zip_code","age_category"] cat_index_output_cols = [x + '_index' for x in cat_input_cols] ohe_output_cols = [x + '_ohe' for x in cat_input_cols] stringIndexer = StringIndexer(inputCols=cat_input_cols, outputCols=cat_index_output_cols, handleInvalid="error", stringOrderType="alphabetDesc") ohe_encoder = OneHotEncoder(inputCols=cat_index_output_cols, outputCols=ohe_output_cols) assembler = VectorAssembler(inputCols=features, outputCol="features") pipeline = Pipeline(stages=[stringIndexer, ohe_encoder,assembler]) df_training_transformed = pipeline.fit(df_trainig_set).transform(df_trainig_set)

now when I build my model I like to use `infer_signature` but I'm not able to find the right way to do it.

with mlflow.start_run(run_name="content_based_linReg") as run: # ---- set hyperparameters lr = LinearRegression(featuresCol=COL_FEATURES, labelCol=COL_LABEL) lr.setMaxIter(MAX_ITER) lr.setRegParam(REG_PARAM) lr.setElasticNetParam(ELASTIC_NET_PARAM) lr.setFitIntercept(FIT_INTERCEPT) # ---- Split data into training and validation sets (df_training_sampled, df_validation_sampled) = df_content.randomSplit([0.6,0.4],SEEDS) #Create the model. lr_model = lr.fit(df_training_sampled) # Log the model parameters used for this run. mlflow.log_param("MAX_ITER", MAX_ITER) mlflow.log_param("REG_PARAM", REG_PARAM) mlflow.log_param("ELASTIC_NET_PARAM", ELASTIC_NET_PARAM) mlflow.log_param("FIT_INTERCEPT", FIT_INTERCEPT) # Run the model to create a prediction. Predict against the validation_df. df_validation_predictions = lr_model.transform(df_validation_sampled) # -->> this won't work # Log the model with signature and input example signature = infer_signature(df_training_sampled["features"], df_validation_predictions["prediction"]) input_example = df_training_sampled.select(col("features"),col("rating")).head(3) mlflow.spark.log_model(lr_model, "LinearRegression",sample_input=input_example,signature=signature)

here is the error I get when I call the `infer_signature`

Exception: Unsupported Spark Type '<class 'pyspark.ml.linalg.VectorUDT'>', MLflow schema is only supported for scalar Spark types.

any idea how should I go about it?

Re: log signature and input data for Spark LinearRegression

MohsenJ — Fri, 16 Feb 2024 10:43:53 GMT

thanks @Retired_mod

three clarification questions:

1. wouldn't this cause issues when I load the model for inference? because in this case the model signature is different from the input?

2. I also need to log the signature of prediction output. should I just do

df_predictions = lr_model.transform(df_validation_sampled) signature = ModelSignature(inputs=Schema(input_columns), output=Schema(df_predictions[["prediction]]))

3. to pass the input_sample, should I also just pass the rows of my dataset before transformation?

Re: log signature and input data for Spark LinearRegression

MohsenJ — Fri, 16 Feb 2024 13:09:08 GMT

and one more question. you use mlflow.sklearn instead of mlflow.spark. why is that?

Re: log signature and input data for Spark LinearRegression

javierbg — Tue, 14 May 2024 07:09:27 GMT

Hey, at my team we are faced with the same problem.

Given that MLflow usage is part of the "Scalable Machine Learning with Apache Spark" course I got the impression that Spark ML models would work well with MLflow and Databricks model logging, but it's clearly not the case. I went back to the course material and, sure enough, when explaining the model registry it switches to using sklearn models. I gotta say, I feel a bit betrayed...

The solution given by @Retired_mod partially works but it is a hack and, as @MohsenJ said, it doesn't really cover the signature of the output. Is there a proper/more complete solution?

Re: log signature and input data for Spark LinearRegression

Abi105 — Tue, 21 May 2024 14:54:02 GMT

Hi @Retired_mod, @MohsenJ, did we get this working? I am facing a similar issue when trying to migrate one of the registered model to unity catalog. The infer signature from Mlflow doesn't seem to work as the input spark data frame contains features from VectorAssember that is recognised as type VectorUDT().

mlflow.pyspark.ml: Model inputs contain unsupported Spark data types: [StructField('prd_dt', DateType(), False), StructField('features', VectorUDT(), True), StructField('features_scaled', VectorUDT(), True)]. Model signature is not logged.

mlflow.data.spark_dataset: Failed to infer schema for Spark dataset. Exception: Unsupported Spark Type '<class 'pyspark.ml.linalg.VectorUDT'>' for MLflow schema.

Have we found a proper solution for MLFlow to support datatypes other than scalar?

Would appreciate any inputs.

Re: log signature and input data for Spark LinearRegression

javierbg — Tue, 21 May 2024 14:56:16 GMT

@Abi105 I wasn't able to make it work, sorry

Re: log signature and input data for Spark LinearRegression

MohsenJ — Wed, 22 May 2024 12:28:31 GMT

me neither. But the mlflow documentation suggests the new version of mlflow should be able to handle Array and Objects (dict). maybe that could help? I haven't tired it myself.

Support for Array and Object types was introduced in MLflow version 2.10.0. These types will not be recognized in previous versions of MLflow. If you are saving a model that uses these signature types, you should ensure that any other environment that attempts to load these models has a version of MLflow installed that is at least 2.10.0.

Re: log signature and input data for Spark LinearRegression

ac10 — Tue, 17 Sep 2024 17:54:08 GMT

@MohsenJ @javierbg @Abi105 I have found a solution to this issue as I was trying to deploy Spark ML Models to Unity Catalog. Please view my blog and let me know if it helps solve your issues! https://medium.com/p/7d04e8539540

Re: log signature and input data for Spark LinearRegression

LuluLiu — Wed, 29 Jan 2025 20:31:34 GMT

I accidentally stumbled upon this ticket when researching on a similar issue. Note that starting from MLflow 2.15.0 it supports VectorUDT. https://mlflow.org/releases/2.15.0