Mlflow.evaluation fails to generate score

gendg — Fri, 30 Aug 2024 09:10:33 GMT

The execution of code stucks when evaluation of data start.

eval_df = pd.DataFrame(

{

"inputs": [

"What is MLflow?",

"What is Spark?",

"ground_truth": [

"MLflow is an open-source platform for managing the end-to-end machine learning (ML) "

"lifecycle. It was developed by Databricks, a company that specializes in big data and "

"machine learning solutions. MLflow is designed to address the challenges that data "

"scientists and machine learning engineers face when developing, training, and deploying "

"machine learning models.",

"Apache Spark is an open-source, distributed computing system designed for big data "

"processing and analytics. It was developed in response to limitations of the Hadoop "

"MapReduce computing model, offering improvements in speed and ease of use. Spark "

"provides libraries for various tasks such as data ingestion, processing, and analysis "

"through its components like Spark SQL for structured data, Spark Streaming for "

"real-time data processing, and MLlib for machine learning tasks",

}

)

with mlflow.start_run(run_name="logging_model_as_openai_model", log_system_metrics=True) as run:

mlflow.doctor()

#log model as pyfunc

logged_model = mlflow.pyfunc.log_model(artifact_path="model", python_model=llm_response, pip_requirements=["openai"], signature=None)

run_id = mlflow.active_run().info.run_id

# load model using runid

model = mlflow.pyfunc.load_model(f"runs:/{run_id}/model")

results = mlflow.evaluate(

model,

eval_df,

targets="ground_truth", # specify which column corresponds to the expected output

model_type="question-answering", # model type indicates which metrics are relevant for this task

evaluators="default",

)

results.metrics

Re: Mlflow.evaluation fails to generate score

mark_ott — Fri, 07 Nov 2025 16:56:39 GMT

The issue described—a Databricks notebook getting "stuck" during the evaluation phase using mlflow.evaluate—is most likely related to environment setup, model compatibility, or limitations with the mlflow.pyfunc.log_model and the evaluation utilities for custom models such as OpenAI or LLM-based text generators within the Databricks ML runtime.

Key Points to Address

1. MLflow Pyfunc Model Structure

MLflow expects a Python class implementing mlflow.pyfunc.PythonModel as the python_model argument, but if llm_response in your code is not subclassed and structured as expected, logging or evaluation may fail or hang while trying to score each row.

2. `mlflow.evaluate` Support for Custom Text Models

The mlflow.evaluate function is primarily designed for standard ML models (e.g., classifiers, regressors). The "question-answering" model type and evaluators="default" may not be supported out-of-the-box for custom text generation models unless custom evaluators are implemented.

3. Model Loading and Inference

When loading a logged model via mlflow.pyfunc.load_model, MLflow will package and reload the Python environment specified in pip_requirements. Any issues with package dependencies, unresolved imports, or compatibility with the Databricks cluster could cause hangs or timeouts during batch inference.

Troubleshooting Steps

Check model implementation: Ensure that llm_response is a valid subclass of mlflow.pyfunc.PythonModel with a proper predict(self, context, model_input) method.
Review evaluator setup: The "question-answering" model type and default evaluator may not match your use case. You may need to specify or implement a custom evaluator compatible with your model's output.
Validate the MLflow runtime: Confirm that the Python environment (versions, packages) in Databricks matches the requirements for MLflow's OpenAI integration.
Log and test separately: Try running inference one row at a time (outside of MLflow evaluation), logging outputs manually to pinpoint where the code hangs.

Example Minimal Model Logging

python

class MyLLMModel(mlflow.pyfunc.PythonModel):
    def predict(self, context, model_input):
        # returns text answers based on input
        return ["Answer 1", "Answer 2"]

# Log model
logged_model = mlflow.pyfunc.log_model(
    artifact_path="model",
    python_model=MyLLMModel(),
    pip_requirements=["openai"]
)

Test inference before running mlflow.evaluate:

python

model = mlflow.pyfunc.load_model(logged_model.model_uri)
print(model.predict(eval_df["inputs"]))  # Should produce output quickly

If manual prediction works but evaluation hangs, the issue is with the evaluation step.

Documentation and Compatibility

Reference documentation for the versions of MLflow, Databricks Runtime, and Python is critical. For recent changes or limitations, consult Databricks MLflow docs, especially if using custom Python models with generative or non-standard outputs.

Bottom Line

The code hangs because MLflow's evaluation utility is likely not fully compatible with custom text generation models or the setup did not match expected input/output structures. Manual row-by-row inference can clarify whether the issue lies in evaluation or model serving.

topic Re: Mlflow.evaluation fails to generate score in Generative AI