Hello @drjb1010 ,
This is a known issue with DBR 14.3 where the `virtualenv` environment manager fails because it depends on `pyenv` to install specific Python versions, but `pyenv` is either not installed or not properly configured in the runtime environment.
## Understanding the Problem
The error occurs because when you specify `env_manager="virtualenv"`, MLflow attempts to create an isolated Python environment matching your model's training environment. It tries to use `pyenv` to install Python 3.9.19, but the command fails with exit code 2, indicating that either:
- `pyenv` is not properly installed in DBR 14.3
- The Python version (3.9.19) cannot be installed via pyenv
- Required dependencies for building Python from source are missing
The transition away from `conda` as an environment manager has left `virtualenv` as an option, but it has dependencies that aren't fully satisfied in DBR 14.3.
Recommended Solution
Use `env_manager="local"` instead of `env_manager="virtualenv"`:
```python
model_udf_score = mlflow.pyfunc.spark_udf(
spark,
model_version_uri,
env_manager="local", # Change from "virtualenv" to "local"
params={"predict_method": "predict_score"}
)
```
What This Means
When using `env_manager="local"`:
- The model will use the cluster's existing Python environment
- No isolated environment creation occurs
- Dependencies must already be installed on the cluster
- You lose the environment isolation benefit but gain stability
Ensuring Dependencies Are Met
Since you're using the local environment, make sure your cluster has the required dependencies installed:
Option 1: Install via notebook
```python
%pip install -r /path/to/requirements.txt
```
Option 2: Cluster Libraries
Install the required libraries directly on the cluster through the Databricks UI under cluster configuration.
Option 3: Init Scripts
Create an init script to install dependencies when the cluster starts.
Alternative Approach
If you absolutely need environment isolation, consider:
Pre-installing dependencies: Before loading the model, manually install all required packages that match your model's dependencies using `%pip install`.
Use Model Serving: Instead of using `spark_udf`, deploy your model to a Model Serving endpoint, which handles environment management differently.
Long-term Recommendation
Monitor Databricks release notes for updates to environment management in future DBR versions. The current state suggests that `env_manager="local"` is the most reliable option until Databricks provides better support for isolated environments without conda dependency.
This issue has been reported by multiple users and appears to be a gap in the current DBR 14.3 implementation. Using `env_manager="local"` is the recommended workaround that will allow you to proceed with your inference workload.
Hope this helps, Louis.