Databricks Community

Miki · ‎06-04-2024

I am having a similar issue thislog signature and input data for Spark LinearRegression using mlflow v2.13.0 and using mlflow.pyfunc.log_model to log my model. Starting a new post here since there doesn't seem to be any follow up from the community on that.

While I am able to save a signature with array types, running inference on a model with the signature logged is more than 100x slower, which is not acceptable for my use case. Skipping logging the array columns as the original answer suggests in the linked post does not work, since this throws a key error at inference time on the skipped columns. Without logging the signature, I cannot register the model in the unity catalog. However, I know there must be a way around this because if you use Databrick's FeatureEngineeringClient's log model function here, it has no problem registering the model and running inference in a reasonable amount of time, and based on the logged model schema, it does seem to be skipping these array columns somehow. However, I cannot use the FeatureEngineeringClient's log model function because this doesn't allow me to pass in a custom loss function. Any advice here would be appreciated.

MohsenJ · ‎06-06-2024

@Miki can you please share you code for logging the signature with array types

Miki · ‎06-13-2024

Sure you can reference the commented out code here which builds the signature from a pyspark dataframe with array types. This signature would then be passed in on L275 (also commented out). Please let me know if you get this working without making inference unbearably slow.