SteveOstrowski
Databricks Employee
Databricks Employee

Hi,

This is a well-documented issue that comes down to cluster access mode and how mlflow.spark.load_model handles temporary file storage. Let me break down both problems you are hitting and provide solutions.


PROBLEM 1: OSError: [Errno 5] Input/output error: '/dbfs/tmp'

The root cause is that mlflow.spark.load_model uses the dfs_tmpdir parameter (which defaults to /tmp/mlflow) to temporarily stage model artifacts via the DBFS FUSE mount at /dbfs/. On Shared (Standard) access mode clusters, DBFS FUSE is not supported. From the Databricks documentation on access mode limitations:

"DBFS root and mounts do not support FUSE" and "POSIX-style paths (/) for DBFS are not supported."

This means mlflow.spark.load_model will always fail on Shared/Standard clusters because it cannot write to /dbfs/tmp.

Docs: https://docs.databricks.com/en/compute/access-mode-limitations.html


SOLUTION OPTIONS FOR PROBLEM 1

Option A -- Use a Dedicated (Single User) access mode cluster

This is the simplest fix. Dedicated access mode clusters support DBFS FUSE mounts, so mlflow.spark.load_model works out of the box:

import mlflow

model = mlflow.spark.load_model("models:/your_model_name@production")
predictions = model.transform(test_df)

Machine learning workloads on Databricks generally require Dedicated access mode.

Docs: https://docs.databricks.com/en/machine-learning/manage-model-lifecycle/index.html


Option B -- Use a Unity Catalog Volume as a staging path (your workaround, refined)

Your workaround of downloading artifacts to a UC Volume is actually a solid approach. Here is a cleaner version:

from pyspark.ml import PipelineModel
import mlflow

mlflow.set_registry_uri("databricks-uc")

catalogue = "your_catalog"
schema = "your_schema"
volume = "your_volume"

local_model_path = "/local_disk0/mlflow_model"
volume_path = f"/Volumes/{catalogue}/{schema}/{volume}/sparkml_model"

# Step 1: Download artifacts to the driver's local disk
mlflow.artifacts.download_artifacts(
artifact_uri="models:/your_model_name@production",
dst_path=local_model_path
)

# Step 2: Copy to UC Volume (accessible by all workers)
dbutils.fs.cp(
f"file://{local_model_path}/sparkml",
volume_path,
recurse=True
)

# Step 3: Load using PipelineModel.load directly
model = PipelineModel.load(volume_path)

# Step 4: Transform -- since PipelineModel wraps your FMRegressionModel,
# this works directly on Spark DataFrames with VectorUDT columns
predictions = model.transform(test_df)

This avoids the DBFS FUSE requirement entirely. UC Volumes are accessible from all cluster access modes.


Option C -- Set dfs_tmpdir to a cloud storage path

If you are on a Dedicated cluster but still hitting the error, you can explicitly set dfs_tmpdir to a cloud storage path:

model = mlflow.spark.load_model(
"models:/your_model_name@production",
dfs_tmpdir="dbfs:/tmp/mlflow_staging"
)

Or to a UC Volume path:

model = mlflow.spark.load_model(
"models:/your_model_name@production",
dfs_tmpdir="/Volumes/your_catalog/your_schema/your_volume/mlflow_tmp"
)


PROBLEM 2: mlflow.pyfunc.spark_udf FAILS WITH VectorUDT

The error you see:

IllegalArgumentException: requirement failed: Column features must be of type
class org.apache.spark.ml.linalg.VectorUDT but was actually class
org.apache.spark.sql.types.StructType

This happens because mlflow.pyfunc.spark_udf routes data through pandas during UDF execution. Spark ML's VectorUDT is a special type that pandas cannot represent natively -- it gets decomposed into a plain struct with type, size, and values fields. When the data comes back into Spark, the model expects a VectorUDT column but receives a StructType instead.

This is a fundamental limitation of the pyfunc/UDF approach for Spark ML models that use VectorUDT features. The workaround is to avoid the pyfunc path entirely and use the native Spark ML API (i.e., PipelineModel.transform()) as shown in Option A or B above.


RECOMMENDED APPROACH

The cleanest solution is to use a Dedicated access mode cluster and call mlflow.spark.load_model directly. If you must use a Shared cluster, use the UC Volume workaround (Option B) to download, copy, and load via PipelineModel.load.

Since mlflow.spark.load_model returns a PipelineModel anyway (MLflow wraps individual models like FMRegressionModel in a PipelineModel during logging), using PipelineModel.load directly in Option B gives you the same result. You can then call .transform() on it with your VectorAssembler-transformed DataFrame and it will work correctly since the data stays in Spark's native format without pandas serialization.


DOCUMENTATION REFERENCES

- MLflow Spark API docs - load_model: https://mlflow.org/docs/latest/python_api/mlflow.spark.html
- Databricks access mode limitations: https://docs.databricks.com/en/compute/access-mode-limitations.html
- Manage model lifecycle in Unity Catalog: https://docs.databricks.com/en/machine-learning/manage-model-lifecycle/index.html
- MLflow Spark ML flavor guide: https://mlflow.org/docs/latest/ml/traditional-ml/sparkml/

Hope this helps! Let me know if you have questions about any of the approaches.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.