cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

Logging spark pipeline model using mlflow spark , leads to PythonSecurityException

Saeid_H
Contributor

Hello,

I am currently using a simple pyspark pipeline to transform my training data, fit model and log the model using mlflow.spark. But I get this following error (with mlflow.sklearn it works perfectly fine but due to size of my data I need to use pyspark ml library):

org.apache.spark.api.python.PythonSecurityException: Path 'mlflowdbfs:/artifacts?run_id=d2ecf91f0&path=/best_model/sparkml/metadata' uses an untrusted filesystem 'com.databricks.mlflowdbfs.MlflowdbfsFileSystem', but your administrator has configured Spark to only allow trusted filesystems: (com.databricks.s3a.S3AFileSystem, shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystemHadoop3, shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem, shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem, shaded.databricks.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem, com.databricks.adl.AdlFileSystem, shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystemHadoop3, shaded.databricks.V2_1_4.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem, shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem, shaded.databricks.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemHadoop3, shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem)

here is the code that I use :

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
import mlflow
from mlflow import spark
 
# Start an MLflow run and set experiment
with mlflow.start_run():
    mlflow.set_experiment("/Users/my-id/experiments")
 
    # Read in data from a CSV file
    data = spark.read.csv("dbfs:/FileStore/tables/data.csv", header=True, inferSchema=True)
 
    # Preprocess data
    labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
    assembler = VectorAssembler(inputCols=data.columns[:-1], outputCol="features")
    pipeline = Pipeline(stages=[labelIndexer, assembler])
    preprocessedData = pipeline.fit(data).transform(data)
 
    # Split data into training and test sets
    (trainingData, testData) = preprocessedData.randomSplit([0.7, 0.3])
 
    # Define model and hyperparameters to tune
    rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="features")
    paramGrid = ParamGridBuilder() \
        .addGrid(rf.numTrees, [10, 20, 30]) \
        .addGrid(rf.maxDepth, [5, 10, 15]) \
        .build()
 
    # Evaluate model using area under ROC
    evaluator = BinaryClassificationEvaluator(labelCol="indexedLabel", metricName="areaUnderROC")
 
    # Perform cross-validation to tune hyperparameters
    cv = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)
    cvModel = cv.fit(trainingData)
 
    # Log model and its metrics
    mlflow.spark.log_model(spark_model=cvModel.bestModel, artifact_path="best_model")

does anyone know how to solve this issue?

thanks in advance!

2 REPLIES 2

Anonymous
Not applicable

@Saeid Hedayati​ :

The error message indicates that the mlflow.spark.log_model function is attempting to save the model metadata to an untrusted filesystem called com.databricks.mlflowdbfs.MlflowdbfsFileSystem, but Spark has been configured to only allow trusted filesystems.

One potential solution to this issue is to explicitly set the filesystem type used by MLflow to a trusted filesystem like S3 or Azure Blob Storage. You can do this by setting the MLFLOW_EXPERIMENT_STORAGE environment variable to the desired filesystem type.

For example, if you are using S3 as your artifact store, you can set the MLFLOW_EXPERIMENT_STORAGE

environment variable as follows:

import os
os.environ['MLFLOW_EXPERIMENT_STORAGE'] = 's3://my-bucket/mlflow'

Replace my-bucket with the name of your S3 bucket and mlflow with the desired path in the bucket.

Alternatively, you can try saving the model metadata to a local filesystem instead of a DBFS path by specifying a local path for the artifact_uri parameter of the mlflow.start_run function:

with mlflow.start_run(artifact_uri='/path/to/local/dir'):
    # ...
    mlflow.spark.log_model(spark_model=cvModel.bestModel, artifact_path="best_model")

Replace /path/to/local/dir with the path to a local directory where you want to save the model metadata.

Anonymous
Not applicable

Hi @Saeid Hedayati​ 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.