cancel
Showing results for 
Search instead for 
Did you mean: 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results for 
Search instead for 
Did you mean: 

Mlflow.evaluation fails to generate score

gendg
New Contributor
The execution of code stucks when evaluation of data start. 
 
eval_df = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning (ML) "
            "lifecycle. It was developed by Databricks, a company that specializes in big data and "
            "machine learning solutions. MLflow is designed to address the challenges that data "
            "scientists and machine learning engineers face when developing, training, and deploying "
            "machine learning models.",
            "Apache Spark is an open-source, distributed computing system designed for big data "
            "processing and analytics. It was developed in response to limitations of the Hadoop "
            "MapReduce computing model, offering improvements in speed and ease of use. Spark "
            "provides libraries for various tasks such as data ingestion, processing, and analysis "
            "through its components like Spark SQL for structured data, Spark Streaming for "
            "real-time data processing, and MLlib for machine learning tasks",
        ],
    }
)

with
mlflow.start_run(run_name="logging_model_as_openai_model", log_system_metrics=True) as run:
    mlflow.doctor()
 #log model as pyfunc
    logged_model = mlflow.pyfunc.log_model(artifact_path="model", python_model=llm_response, pip_requirements=["openai"], signature=None)
    run_id = mlflow.active_run().info.run_id

 

    # load model using runid
    model = mlflow.pyfunc.load_model(f"runs:/{run_id}/model")
 
    results = mlflow.evaluate(
        model,
        eval_df,
        targets="ground_truth",  # specify which column corresponds to the expected output
        model_type="question-answering",  # model type indicates which metrics are relevant for this task
        evaluators="default",
    )
results.metrics
1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

The issue described—a Databricks notebook getting "stuck" during the evaluation phase using mlflow.evaluate—is most likely related to environment setup, model compatibility, or limitations with the mlflow.pyfunc.log_model and the evaluation utilities for custom models such as OpenAI or LLM-based text generators within the Databricks ML runtime.

Key Points to Address

1. MLflow Pyfunc Model Structure

MLflow expects a Python class implementing mlflow.pyfunc.PythonModel as the python_model argument, but if llm_response in your code is not subclassed and structured as expected, logging or evaluation may fail or hang while trying to score each row.

2. mlflow.evaluate Support for Custom Text Models

The mlflow.evaluate function is primarily designed for standard ML models (e.g., classifiers, regressors). The "question-answering" model type and evaluators="default" may not be supported out-of-the-box for custom text generation models unless custom evaluators are implemented.

3. Model Loading and Inference

When loading a logged model via mlflow.pyfunc.load_model, MLflow will package and reload the Python environment specified in pip_requirements. Any issues with package dependencies, unresolved imports, or compatibility with the Databricks cluster could cause hangs or timeouts during batch inference.

Troubleshooting Steps

  • Check model implementation: Ensure that llm_response is a valid subclass of mlflow.pyfunc.PythonModel with a proper predict(self, context, model_input) method.

  • Review evaluator setup: The "question-answering" model type and default evaluator may not match your use case. You may need to specify or implement a custom evaluator compatible with your model's output.

  • Validate the MLflow runtime: Confirm that the Python environment (versions, packages) in Databricks matches the requirements for MLflow's OpenAI integration.

  • Log and test separately: Try running inference one row at a time (outside of MLflow evaluation), logging outputs manually to pinpoint where the code hangs.

Example Minimal Model Logging

python
class MyLLMModel(mlflow.pyfunc.PythonModel): def predict(self, context, model_input): # returns text answers based on input return ["Answer 1", "Answer 2"] # Log model logged_model = mlflow.pyfunc.log_model( artifact_path="model", python_model=MyLLMModel(), pip_requirements=["openai"] )

Test inference before running mlflow.evaluate:

python
model = mlflow.pyfunc.load_model(logged_model.model_uri) print(model.predict(eval_df["inputs"])) # Should produce output quickly

If manual prediction works but evaluation hangs, the issue is with the evaluation step.

Documentation and Compatibility

Reference documentation for the versions of MLflow, Databricks Runtime, and Python is critical. For recent changes or limitations, consult Databricks MLflow docs, especially if using custom Python models with generative or non-standard outputs.

Bottom Line

The code hangs because MLflow's evaluation utility is likely not fully compatible with custom text generation models or the setup did not match expected input/output structures. Manual row-by-row inference can clarify whether the issue lies in evaluation or model serving.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now