<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: log signature and input data for Spark LinearRegression in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/log-signature-and-input-data-for-spark-linearregression/m-p/70266#M9681</link>
    <description>&lt;P&gt;me neither. But the &lt;A href="https://mlflow.org/docs/latest/model/signatures.html" target="_self"&gt;mlflow documentation&lt;/A&gt; suggests the new version of mlflow should be able to handle Array and Objects (dict).&amp;nbsp;&lt;SPAN&gt;maybe that could help? I haven't tired it myself.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;Support for Array and Object types was introduced in MLflow version 2.10.0. These types will not be recognized in previous versions of MLflow. If you are saving a model that uses these signature types, you should ensure that any other environment that attempts to load these models has a version of MLflow installed that is at least 2.10.0.&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 22 May 2024 12:28:31 GMT</pubDate>
    <dc:creator>MohsenJ</dc:creator>
    <dc:date>2024-05-22T12:28:31Z</dc:date>
    <item>
      <title>log signature and input data for Spark LinearRegression</title>
      <link>https://community.databricks.com/t5/get-started-discussions/log-signature-and-input-data-for-spark-linearregression/m-p/60326#M9674</link>
      <description>&lt;P&gt;I am looking for a way to log my `&lt;SPAN&gt;pyspark.ml.regression.&lt;/SPAN&gt;&lt;FONT face="inherit" color="#183139"&gt;LinearRegression` model with input and &lt;/FONT&gt;&lt;FONT color="#183139"&gt;signature&lt;/FONT&gt;&lt;FONT face="inherit" color="#183139"&gt;&amp;nbsp;ata. The usual example that I found around are using sklearn and they can simply do&amp;nbsp;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;# Log the model with signature and input example
signature = infer_signature(X_train, pd.DataFrame(y_train))
input_example = X_train.head(3)
mlflow.sklearn.log_model(rf_model, "rf_model", signature=signature, input_example=input_example)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT face="inherit" color="#183139"&gt;but it doesn't work for `LinearRegression` because the "feature" column should be Spark `&lt;SPAN&gt;VectorUDT&lt;/SPAN&gt;&amp;nbsp;` and it is not support by mlflow. This is how I generate my feature column&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;cat_input_cols = ["gender","occupation","zip_code","age_category"]
cat_index_output_cols = [x + '_index' for x in cat_input_cols]
ohe_output_cols = [x + '_ohe' for x in cat_input_cols]
stringIndexer = StringIndexer(inputCols=cat_input_cols, outputCols=cat_index_output_cols, handleInvalid="error", stringOrderType="alphabetDesc")
ohe_encoder = OneHotEncoder(inputCols=cat_index_output_cols, outputCols=ohe_output_cols)
 assembler = VectorAssembler(inputCols=features, outputCol="features")

pipeline = Pipeline(stages=[stringIndexer, ohe_encoder,assembler])
df_training_transformed = pipeline.fit(df_trainig_set).transform(df_trainig_set)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT face="inherit" color="#183139"&gt;now when I build my model I like to use `&lt;/FONT&gt;&lt;SPAN&gt;infer_signature` but I'm not able to find the right way to do it.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;with mlflow.start_run(run_name="content_based_linReg") as run:
   # ---- set hyperparameters
    lr = LinearRegression(featuresCol=COL_FEATURES, labelCol=COL_LABEL)
    lr.setMaxIter(MAX_ITER)
    lr.setRegParam(REG_PARAM)
    lr.setElasticNetParam(ELASTIC_NET_PARAM)
    lr.setFitIntercept(FIT_INTERCEPT)
   
    # ----  Split data into training and validation  sets
    (df_training_sampled, df_validation_sampled) = 
    df_content.randomSplit([0.6,0.4],SEEDS)
    #Create the model.
    lr_model = lr.fit(df_training_sampled)
    
    # Log the model parameters used for this run.
    mlflow.log_param("MAX_ITER", MAX_ITER)
    mlflow.log_param("REG_PARAM", REG_PARAM)
    mlflow.log_param("ELASTIC_NET_PARAM", ELASTIC_NET_PARAM)
    mlflow.log_param("FIT_INTERCEPT", FIT_INTERCEPT)

    # Run the model to create a prediction. Predict against the validation_df.
    df_validation_predictions = lr_model.transform(df_validation_sampled)

    # --&amp;gt;&amp;gt; this won't work
    # Log the model with signature and input example
    signature = infer_signature(df_training_sampled["features"], df_validation_predictions["prediction"])
    input_example = df_training_sampled.select(col("features"),col("rating")).head(3)

    mlflow.spark.log_model(lr_model, "LinearRegression",sample_input=input_example,signature=signature)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;here is the error I get when I call the `infer_signature`&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;Exception: Unsupported Spark Type '&amp;lt;class 'pyspark.ml.linalg.VectorUDT'&amp;gt;', MLflow schema is only supported for scalar Spark types.&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;any idea how should I go about it?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 15 Feb 2024 16:48:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/log-signature-and-input-data-for-spark-linearregression/m-p/60326#M9674</guid>
      <dc:creator>MohsenJ</dc:creator>
      <dc:date>2024-02-15T16:48:03Z</dc:date>
    </item>
    <item>
      <title>Re: log signature and input data for Spark LinearRegression</title>
      <link>https://community.databricks.com/t5/get-started-discussions/log-signature-and-input-data-for-spark-linearregression/m-p/60393#M9676</link>
      <description>&lt;P&gt;thanks &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;three clarification questions:&lt;/P&gt;&lt;P&gt;1.&amp;nbsp; wouldn't this cause issues when I load the model for inference? because in this case the model signature is different from the input?&lt;/P&gt;&lt;P&gt;2. I also need to log the signature of prediction output. should I just do&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;df_predictions = lr_model.transform(df_validation_sampled)
signature = ModelSignature(inputs=Schema(input_columns), output=Schema(df_predictions[["prediction]]))&lt;/LI-CODE&gt;&lt;P&gt;3. to pass the input_sample, should I also just pass the rows of my dataset before transformation?&lt;/P&gt;</description>
      <pubDate>Fri, 16 Feb 2024 10:43:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/log-signature-and-input-data-for-spark-linearregression/m-p/60393#M9676</guid>
      <dc:creator>MohsenJ</dc:creator>
      <dc:date>2024-02-16T10:43:53Z</dc:date>
    </item>
    <item>
      <title>Re: log signature and input data for Spark LinearRegression</title>
      <link>https://community.databricks.com/t5/get-started-discussions/log-signature-and-input-data-for-spark-linearregression/m-p/60413#M9677</link>
      <description>&lt;P&gt;and one more question. you use mlflow.sklearn instead of mlflow.spark. why is that?&lt;/P&gt;</description>
      <pubDate>Fri, 16 Feb 2024 13:09:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/log-signature-and-input-data-for-spark-linearregression/m-p/60413#M9677</guid>
      <dc:creator>MohsenJ</dc:creator>
      <dc:date>2024-02-16T13:09:08Z</dc:date>
    </item>
    <item>
      <title>Re: log signature and input data for Spark LinearRegression</title>
      <link>https://community.databricks.com/t5/get-started-discussions/log-signature-and-input-data-for-spark-linearregression/m-p/68958#M9678</link>
      <description>&lt;P&gt;Hey, at my team we are faced with the same problem.&lt;/P&gt;&lt;P&gt;Given that MLflow usage is part of the "&lt;SPAN&gt;Scalable Machine Learning with Apache Spark" course I got the impression that Spark ML models would work well with MLflow and Databricks model logging, but it's clearly not the case. I went back to the course material and, sure enough, when explaining the model registry it switches to using sklearn models. I gotta say, I feel a bit betrayed...&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;The solution given by&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;partially works but it is a hack and, as&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/97748"&gt;@MohsenJ&lt;/a&gt;&amp;nbsp;said, it doesn't really cover the signature of the output. Is there a proper/more complete solution?&lt;/P&gt;</description>
      <pubDate>Tue, 14 May 2024 07:09:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/log-signature-and-input-data-for-spark-linearregression/m-p/68958#M9678</guid>
      <dc:creator>javierbg</dc:creator>
      <dc:date>2024-05-14T07:09:27Z</dc:date>
    </item>
    <item>
      <title>Re: log signature and input data for Spark LinearRegression</title>
      <link>https://community.databricks.com/t5/get-started-discussions/log-signature-and-input-data-for-spark-linearregression/m-p/70140#M9679</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;,&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/97748"&gt;@MohsenJ&lt;/a&gt;, did we get this working? I am facing a similar issue when trying to migrate one of the registered model to unity catalog. The infer signature from Mlflow doesn't seem to work as the input spark data frame contains features from VectorAssember that is recognised as type&amp;nbsp;&lt;SPAN&gt;VectorUDT().&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;mlflow.pyspark.ml: Model inputs contain unsupported Spark data types: [StructField('prd_dt', DateType(), False), StructField('features', VectorUDT(), True), StructField('features_scaled', VectorUDT(), True)]. Model signature is not logged.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;mlflow.data.spark_dataset: Failed to infer schema for Spark dataset. Exception: Unsupported Spark Type '&amp;lt;class 'pyspark.ml.linalg.VectorUDT'&amp;gt;' for MLflow schema.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Have we found a proper solution for MLFlow to support datatypes other than scalar?&lt;/P&gt;&lt;P&gt;Would appreciate any inputs.&lt;/P&gt;</description>
      <pubDate>Tue, 21 May 2024 14:54:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/log-signature-and-input-data-for-spark-linearregression/m-p/70140#M9679</guid>
      <dc:creator>Abi105</dc:creator>
      <dc:date>2024-05-21T14:54:02Z</dc:date>
    </item>
    <item>
      <title>Re: log signature and input data for Spark LinearRegression</title>
      <link>https://community.databricks.com/t5/get-started-discussions/log-signature-and-input-data-for-spark-linearregression/m-p/70141#M9680</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/105565"&gt;@Abi105&lt;/a&gt;&amp;nbsp;I wasn't able to make it work, sorry&lt;/P&gt;</description>
      <pubDate>Tue, 21 May 2024 14:56:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/log-signature-and-input-data-for-spark-linearregression/m-p/70141#M9680</guid>
      <dc:creator>javierbg</dc:creator>
      <dc:date>2024-05-21T14:56:16Z</dc:date>
    </item>
    <item>
      <title>Re: log signature and input data for Spark LinearRegression</title>
      <link>https://community.databricks.com/t5/get-started-discussions/log-signature-and-input-data-for-spark-linearregression/m-p/70266#M9681</link>
      <description>&lt;P&gt;me neither. But the &lt;A href="https://mlflow.org/docs/latest/model/signatures.html" target="_self"&gt;mlflow documentation&lt;/A&gt; suggests the new version of mlflow should be able to handle Array and Objects (dict).&amp;nbsp;&lt;SPAN&gt;maybe that could help? I haven't tired it myself.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;Support for Array and Object types was introduced in MLflow version 2.10.0. These types will not be recognized in previous versions of MLflow. If you are saving a model that uses these signature types, you should ensure that any other environment that attempts to load these models has a version of MLflow installed that is at least 2.10.0.&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 22 May 2024 12:28:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/log-signature-and-input-data-for-spark-linearregression/m-p/70266#M9681</guid>
      <dc:creator>MohsenJ</dc:creator>
      <dc:date>2024-05-22T12:28:31Z</dc:date>
    </item>
    <item>
      <title>Re: log signature and input data for Spark LinearRegression</title>
      <link>https://community.databricks.com/t5/get-started-discussions/log-signature-and-input-data-for-spark-linearregression/m-p/90773#M9682</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/97748"&gt;@MohsenJ&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/102644"&gt;@javierbg&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/105565"&gt;@Abi105&lt;/a&gt;&amp;nbsp;&amp;nbsp;I have found a solution to this issue as I was trying to deploy Spark ML Models to Unity Catalog. Please view my blog and let me know if it helps solve your issues! &lt;A href="https://medium.com/p/7d04e8539540" target="_blank" rel="noopener"&gt;https://medium.com/p/7d04e8539540&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 17 Sep 2024 17:54:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/log-signature-and-input-data-for-spark-linearregression/m-p/90773#M9682</guid>
      <dc:creator>ac10</dc:creator>
      <dc:date>2024-09-17T17:54:08Z</dc:date>
    </item>
    <item>
      <title>Re: log signature and input data for Spark LinearRegression</title>
      <link>https://community.databricks.com/t5/get-started-discussions/log-signature-and-input-data-for-spark-linearregression/m-p/107663#M9683</link>
      <description>&lt;P&gt;I accidentally stumbled upon this ticket when researching on a similar issue. Note that starting from MLflow 2.15.0 it supports VectorUDT.&amp;nbsp;&lt;A href="https://mlflow.org/releases/2.15.0" target="_blank"&gt;https://mlflow.org/releases/2.15.0&lt;/A&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Jan 2025 20:31:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/log-signature-and-input-data-for-spark-linearregression/m-p/107663#M9683</guid>
      <dc:creator>LuluLiu</dc:creator>
      <dc:date>2025-01-29T20:31:34Z</dc:date>
    </item>
  </channel>
</rss>

