The issue described—a Databricks notebook getting "stuck" during the evaluation phase using mlflow.evaluate—is most likely related to environment setup, model compatibility, or limitations with the mlflow.pyfunc.log_model and the evaluation utilities for custom models such as OpenAI or LLM-based text generators within the Databricks ML runtime.
Key Points to Address
1. MLflow Pyfunc Model Structure
MLflow expects a Python class implementing mlflow.pyfunc.PythonModel as the python_model argument, but if llm_response in your code is not subclassed and structured as expected, logging or evaluation may fail or hang while trying to score each row.
2. mlflow.evaluate Support for Custom Text Models
The mlflow.evaluate function is primarily designed for standard ML models (e.g., classifiers, regressors). The "question-answering" model type and evaluators="default" may not be supported out-of-the-box for custom text generation models unless custom evaluators are implemented.
3. Model Loading and Inference
When loading a logged model via mlflow.pyfunc.load_model, MLflow will package and reload the Python environment specified in pip_requirements. Any issues with package dependencies, unresolved imports, or compatibility with the Databricks cluster could cause hangs or timeouts during batch inference.
Troubleshooting Steps
-
Check model implementation: Ensure that llm_response is a valid subclass of mlflow.pyfunc.PythonModel with a proper predict(self, context, model_input) method.
-
Review evaluator setup: The "question-answering" model type and default evaluator may not match your use case. You may need to specify or implement a custom evaluator compatible with your model's output.
-
Validate the MLflow runtime: Confirm that the Python environment (versions, packages) in Databricks matches the requirements for MLflow's OpenAI integration.
-
Log and test separately: Try running inference one row at a time (outside of MLflow evaluation), logging outputs manually to pinpoint where the code hangs.
Example Minimal Model Logging
class MyLLMModel(mlflow.pyfunc.PythonModel):
def predict(self, context, model_input):
# returns text answers based on input
return ["Answer 1", "Answer 2"]
# Log model
logged_model = mlflow.pyfunc.log_model(
artifact_path="model",
python_model=MyLLMModel(),
pip_requirements=["openai"]
)
Test inference before running mlflow.evaluate:
model = mlflow.pyfunc.load_model(logged_model.model_uri)
print(model.predict(eval_df["inputs"])) # Should produce output quickly
If manual prediction works but evaluation hangs, the issue is with the evaluation step.
Documentation and Compatibility
Reference documentation for the versions of MLflow, Databricks Runtime, and Python is critical. For recent changes or limitations, consult Databricks MLflow docs, especially if using custom Python models with generative or non-standard outputs.
Bottom Line
The code hangs because MLflow's evaluation utility is likely not fully compatible with custom text generation models or the setup did not match expected input/output structures. Manual row-by-row inference can clarify whether the issue lies in evaluation or model serving.