Logging spark pipeline model using mlflow spark , leads to PythonSecurityException
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-17-2023 08:02 AM
Hello,
I am currently using a simple pyspark pipeline to transform my training data, fit model and log the model using mlflow.spark. But I get this following error (with mlflow.sklearn it works perfectly fine but due to size of my data I need to use pyspark ml library):
org.apache.spark.api.python.PythonSecurityException: Path 'mlflowdbfs:/artifacts?run_id=d2ecf91f0&path=/best_model/sparkml/metadata' uses an untrusted filesystem 'com.databricks.mlflowdbfs.MlflowdbfsFileSystem', but your administrator has configured Spark to only allow trusted filesystems: (com.databricks.s3a.S3AFileSystem, shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystemHadoop3, shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem, shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem, shaded.databricks.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem, com.databricks.adl.AdlFileSystem, shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystemHadoop3, shaded.databricks.V2_1_4.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem, shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem, shaded.databricks.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemHadoop3, shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem)
here is the code that I use :
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
import mlflow
from mlflow import spark
# Start an MLflow run and set experiment
with mlflow.start_run():
mlflow.set_experiment("/Users/my-id/experiments")
# Read in data from a CSV file
data = spark.read.csv("dbfs:/FileStore/tables/data.csv", header=True, inferSchema=True)
# Preprocess data
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
assembler = VectorAssembler(inputCols=data.columns[:-1], outputCol="features")
pipeline = Pipeline(stages=[labelIndexer, assembler])
preprocessedData = pipeline.fit(data).transform(data)
# Split data into training and test sets
(trainingData, testData) = preprocessedData.randomSplit([0.7, 0.3])
# Define model and hyperparameters to tune
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="features")
paramGrid = ParamGridBuilder() \
.addGrid(rf.numTrees, [10, 20, 30]) \
.addGrid(rf.maxDepth, [5, 10, 15]) \
.build()
# Evaluate model using area under ROC
evaluator = BinaryClassificationEvaluator(labelCol="indexedLabel", metricName="areaUnderROC")
# Perform cross-validation to tune hyperparameters
cv = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)
cvModel = cv.fit(trainingData)
# Log model and its metrics
mlflow.spark.log_model(spark_model=cvModel.bestModel, artifact_path="best_model")
does anyone know how to solve this issue?
thanks in advance!
- Labels:
-
Azure databricks
-
Log Model
-
MlFlow
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-09-2023 08:16 AM
@Saeid Hedayati :
The error message indicates that the mlflow.spark.log_model function is attempting to save the model metadata to an untrusted filesystem called com.databricks.mlflowdbfs.MlflowdbfsFileSystem, but Spark has been configured to only allow trusted filesystems.
One potential solution to this issue is to explicitly set the filesystem type used by MLflow to a trusted filesystem like S3 or Azure Blob Storage. You can do this by setting the MLFLOW_EXPERIMENT_STORAGE environment variable to the desired filesystem type.
For example, if you are using S3 as your artifact store, you can set the MLFLOW_EXPERIMENT_STORAGE
environment variable as follows:
import os
os.environ['MLFLOW_EXPERIMENT_STORAGE'] = 's3://my-bucket/mlflow'
Replace my-bucket with the name of your S3 bucket and mlflow with the desired path in the bucket.
Alternatively, you can try saving the model metadata to a local filesystem instead of a DBFS path by specifying a local path for the artifact_uri parameter of the mlflow.start_run function:
with mlflow.start_run(artifact_uri='/path/to/local/dir'):
# ...
mlflow.spark.log_model(spark_model=cvModel.bestModel, artifact_path="best_model")
Replace /path/to/local/dir with the path to a local directory where you want to save the model metadata.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-21-2023 02:01 AM
Hi @Saeid Hedayati
Thank you for posting your question in our community! We are happy to assist you.
To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?
This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!

