Hello,
I am currently using a simple pyspark pipeline to transform my training data, fit model and log the model using mlflow.spark. But I get this following error (with mlflow.sklearn it works perfectly fine but due to size of my data I need to use pyspark ml library):
org.apache.spark.api.python.PythonSecurityException: Path 'mlflowdbfs:/artifacts?run_id=d2ecf91f0&path=/best_model/sparkml/metadata' uses an untrusted filesystem 'com.databricks.mlflowdbfs.MlflowdbfsFileSystem', but your administrator has configured Spark to only allow trusted filesystems: (com.databricks.s3a.S3AFileSystem, shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystemHadoop3, shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem, shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem, shaded.databricks.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem, com.databricks.adl.AdlFileSystem, shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystemHadoop3, shaded.databricks.V2_1_4.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem, shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem, shaded.databricks.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemHadoop3, shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem)
here is the code that I use :
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
import mlflow
from mlflow import spark
# Start an MLflow run and set experiment
with mlflow.start_run():
mlflow.set_experiment("/Users/my-id/experiments")
# Read in data from a CSV file
data = spark.read.csv("dbfs:/FileStore/tables/data.csv", header=True, inferSchema=True)
# Preprocess data
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
assembler = VectorAssembler(inputCols=data.columns[:-1], outputCol="features")
pipeline = Pipeline(stages=[labelIndexer, assembler])
preprocessedData = pipeline.fit(data).transform(data)
# Split data into training and test sets
(trainingData, testData) = preprocessedData.randomSplit([0.7, 0.3])
# Define model and hyperparameters to tune
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="features")
paramGrid = ParamGridBuilder() \
.addGrid(rf.numTrees, [10, 20, 30]) \
.addGrid(rf.maxDepth, [5, 10, 15]) \
.build()
# Evaluate model using area under ROC
evaluator = BinaryClassificationEvaluator(labelCol="indexedLabel", metricName="areaUnderROC")
# Perform cross-validation to tune hyperparameters
cv = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)
cvModel = cv.fit(trainingData)
# Log model and its metrics
mlflow.spark.log_model(spark_model=cvModel.bestModel, artifact_path="best_model")
does anyone know how to solve this issue?
thanks in advance!