โ02-26-2026 09:37 AM
We trained a Spark ML FMRegressor model and registered it to Unity Catalog via MLflow. When attempting to load it back using mlflow.spark.load_model, we get an
OSError: [Errno 5] Input/output error: '/dbfs/tmp' regardless of what dfs_tmpdir path is passed.
Tried:
Using mlflow.pyfunc.spark_udf as an alternative also fails โ when the features VectorUDT column is serialized through pandas during UDF execution, it loses its type and becomes a plain StructType, causing an IllegalArgumentException at inference time.
Does anyone have a fix for this?
โ02-26-2026 09:40 AM
from pyspark.ml import PipelineModel
mlflow.set_registry_uri("databricks-uc")
local_model_path = "/local_disk0/mlflow_model"
volume_path = f"/Volumes/{catalogue}/default/mlflow_tmp/sparkml"
# Works fine - downloads to driver
mlflow.artifacts.download_artifacts(
artifact_uri=f"models:/{model_name}@production",
dst_path=local_model_path
)
# Copy from driver local disk to UC Volume (shared across all nodes)
dbutils.fs.cp(
f"file://{local_model_path}/sparkml",
f"dbfs:{volume_path}",
recurse=True
)
# Load from UC Volume โ all workers can reach this
model = PipelineModel.load(volume_path)
Tried this workaround just now but is there a proper way to read the Regressor type model?
โ02-26-2026 09:40 AM
from pyspark.ml import PipelineModel
mlflow.set_registry_uri("databricks-uc")
local_model_path = "/local_disk0/mlflow_model"
volume_path = f"/Volumes/{catalogue}/default/mlflow_tmp/sparkml"
# Works fine - downloads to driver
mlflow.artifacts.download_artifacts(
artifact_uri=f"models:/{model_name}@production",
dst_path=local_model_path
)
# Copy from driver local disk to UC Volume (shared across all nodes)
dbutils.fs.cp(
f"file://{local_model_path}/sparkml",
f"dbfs:{volume_path}",
recurse=True
)
# Load from UC Volume โ all workers can reach this
model = PipelineModel.load(volume_path)
Tried this workaround just now but is there a proper way to read the Regressor type model?
โ02-26-2026 09:47 AM
I also attempted this:
mlflow.set_registry_uri("databricks-uc")
loaded_model = mlflow.pyfunc.spark_udf(
spark,
model_uri=f"models:/{model_name}@production",
result_type="double"
)
And got it to load in this command cell and tried to do predictions like:
assemble_transform = assembler.transform(allNewRels)
preds_final_df = (
assemble_transform.withColumn(
"prediction",
loaded_model(struct(col("features")))
).select("id", "second_id", "prediction")
)But trying to save the above dataframe to a Delta table caused a Python worker to error out.
pyspark.errors.exceptions.captured.IllegalArgumentException: requirement failed: Column features must be of type class org.apache.spark.ml.linalg.VectorUDT:struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually class org.apache.spark.sql.types.StructType:struct<indices:array<int>,size:bigint,type:bigint,values:array<double>>.I think we can't load the model due to being a Regressor and has this VectorUDT
Just trying to post the things I tried and where it fell short.
a month ago
Hi,
This is a well-documented issue that comes down to cluster access mode and how mlflow.spark.load_model handles temporary file storage. Let me break down both problems you are hitting and provide solutions.
PROBLEM 1: OSError: [Errno 5] Input/output error: '/dbfs/tmp'
The root cause is that mlflow.spark.load_model uses the dfs_tmpdir parameter (which defaults to /tmp/mlflow) to temporarily stage model artifacts via the DBFS FUSE mount at /dbfs/. On Shared (Standard) access mode clusters, DBFS FUSE is not supported. From the Databricks documentation on access mode limitations:
"DBFS root and mounts do not support FUSE" and "POSIX-style paths (/) for DBFS are not supported."
This means mlflow.spark.load_model will always fail on Shared/Standard clusters because it cannot write to /dbfs/tmp.
Docs: https://docs.databricks.com/en/compute/access-mode-limitations.html
SOLUTION OPTIONS FOR PROBLEM 1
Option A -- Use a Dedicated (Single User) access mode cluster
This is the simplest fix. Dedicated access mode clusters support DBFS FUSE mounts, so mlflow.spark.load_model works out of the box:
import mlflow
model = mlflow.spark.load_model("models:/your_model_name@production")
predictions = model.transform(test_df)
Machine learning workloads on Databricks generally require Dedicated access mode.
Docs: https://docs.databricks.com/en/machine-learning/manage-model-lifecycle/index.html
Option B -- Use a Unity Catalog Volume as a staging path (your workaround, refined)
Your workaround of downloading artifacts to a UC Volume is actually a solid approach. Here is a cleaner version:
from pyspark.ml import PipelineModel
import mlflow
mlflow.set_registry_uri("databricks-uc")
catalogue = "your_catalog"
schema = "your_schema"
volume = "your_volume"
local_model_path = "/local_disk0/mlflow_model"
volume_path = f"/Volumes/{catalogue}/{schema}/{volume}/sparkml_model"
# Step 1: Download artifacts to the driver's local disk
mlflow.artifacts.download_artifacts(
artifact_uri="models:/your_model_name@production",
dst_path=local_model_path
)
# Step 2: Copy to UC Volume (accessible by all workers)
dbutils.fs.cp(
f"file://{local_model_path}/sparkml",
volume_path,
recurse=True
)
# Step 3: Load using PipelineModel.load directly
model = PipelineModel.load(volume_path)
# Step 4: Transform -- since PipelineModel wraps your FMRegressionModel,
# this works directly on Spark DataFrames with VectorUDT columns
predictions = model.transform(test_df)
This avoids the DBFS FUSE requirement entirely. UC Volumes are accessible from all cluster access modes.
Option C -- Set dfs_tmpdir to a cloud storage path
If you are on a Dedicated cluster but still hitting the error, you can explicitly set dfs_tmpdir to a cloud storage path:
model = mlflow.spark.load_model(
"models:/your_model_name@production",
dfs_tmpdir="dbfs:/tmp/mlflow_staging"
)
Or to a UC Volume path:
model = mlflow.spark.load_model(
"models:/your_model_name@production",
dfs_tmpdir="/Volumes/your_catalog/your_schema/your_volume/mlflow_tmp"
)
PROBLEM 2: mlflow.pyfunc.spark_udf FAILS WITH VectorUDT
The error you see:
IllegalArgumentException: requirement failed: Column features must be of type
class org.apache.spark.ml.linalg.VectorUDT but was actually class
org.apache.spark.sql.types.StructType
This happens because mlflow.pyfunc.spark_udf routes data through pandas during UDF execution. Spark ML's VectorUDT is a special type that pandas cannot represent natively -- it gets decomposed into a plain struct with type, size, and values fields. When the data comes back into Spark, the model expects a VectorUDT column but receives a StructType instead.
This is a fundamental limitation of the pyfunc/UDF approach for Spark ML models that use VectorUDT features. The workaround is to avoid the pyfunc path entirely and use the native Spark ML API (i.e., PipelineModel.transform()) as shown in Option A or B above.
RECOMMENDED APPROACH
The cleanest solution is to use a Dedicated access mode cluster and call mlflow.spark.load_model directly. If you must use a Shared cluster, use the UC Volume workaround (Option B) to download, copy, and load via PipelineModel.load.
Since mlflow.spark.load_model returns a PipelineModel anyway (MLflow wraps individual models like FMRegressionModel in a PipelineModel during logging), using PipelineModel.load directly in Option B gives you the same result. You can then call .transform() on it with your VectorAssembler-transformed DataFrame and it will work correctly since the data stays in Spark's native format without pandas serialization.
DOCUMENTATION REFERENCES
- MLflow Spark API docs - load_model: https://mlflow.org/docs/latest/python_api/mlflow.spark.html
- Databricks access mode limitations: https://docs.databricks.com/en/compute/access-mode-limitations.html
- Manage model lifecycle in Unity Catalog: https://docs.databricks.com/en/machine-learning/manage-model-lifecycle/index.html
- MLflow Spark ML flavor guide: https://mlflow.org/docs/latest/ml/traditional-ml/sparkml/
Hope this helps! Let me know if you have questions about any of the approaches.
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.
a month ago
Thank you so much for this response! So I did find a fix (with what I posted and your Option B!)
Option A: Didn't work. Surprisingly I used a ML Single User cluster for everything in my job and this wasn't a fix in my case - it still triggered this error.
Option B: Is exactly the route I ended up taking (I had a similar response for people to see/comment on) and I am glad you mention this because is defiantly worked!
Thanks for confirming and for others Option B is the way!