Data Engineering
TypeError: ColSpec.__init__() got an unexpected keyword argument 'required'

Hi Team, one of my customer is facing the below issue.. Anyone faced this issue before ? Any help would be appreciated.

import mlflow


catalog_name = "system"

embed = mlflow.pyfunc.spark_udf(spark, f"models:/", "array<float>")

On running the above piece of code, we are getting the below error

TypeError: ColSpec.__init__() got an unexpected keyword argument 'required'

WARNING mlflow.pyfunc: Detected one or more mismatches between the model's dependencies and the current Python environment: - mlflow (current: 2.7.1, required: mlflow==2.11.2) - torch (current: 2.0.1+cu118, required: torch==2.2.1) - transformers (current: 4.31.0, required: transformers==4.38.2) To fix the mismatches, call `mlflow.pyfunc.get_model_dependencies(model_uri)` to fetch the model's environment and install dependencies using the resulting environment file.

WARNING mlflow.pyfunc: Calling `spark_udf()` with `env_manager="local"` does not recreate the same environment that was used during training, which may lead to errors or inaccurate predictions. We recommend specifying `env_manager="conda"`, which automatically recreates the environment that was used to train the model and performs inference in the recreated environment.


Hi @SangeethagkIt looks like you’re encountering a couple of issues related to mlflow.pyfunc.spark_udf() and model dependencies.

  1. TypeError: ColSpec.init() got an unexpected keyword argument ‘required’:

    • This error occurs when you’re using mlflow.pyfunc.spark_udf() with an unexpected argument.
    • The issue might be related to the way you’re specifying the input columns for the UDF.
    • To resolve this, consider checking the input arguments and ensure they match the expected format.
  2. Model Dependencies Mismatch:

    • The warning about model dependencies indicates that the current Python environment doesn’t match the environment in which the model was trained.
    • To fix this, you can use mlflow.pyfunc.get_model_dependencies(model_uri) to fetch the model’s environment and install the required dependencies using the resulting environment file.
    • Make sure your mlflow version matches the required version (2.11.2) and other dependencies are also aligned.
  3. Environment Manager for spark_udf():

    • The second warning suggests that using env_manager="local" with spark_udf() doesn’t recreate the same environment used during training.
    • To avoid errors or inaccurate predictions, consider specifying env_manager="conda". This will automatically recreate the training environment for inference.

Remember to address these points, and your issue should be resolved. If you need further assistance, feel free to ask! 😊

