Databricks

Saeid_H · ‎02-17-2023

Hello,

I am currently using a simple pyspark pipeline to transform my training data, fit model and log the model using mlflow.spark. But I get this following error (with mlflow.sklearn it works perfectly fine but due to size of my data I need to use pyspark ml library):

org.apache.spark.api.python.PythonSecurityException: Path 'mlflowdbfs:/artifacts?run_id=d2ecf91f0&path=/best_model/sparkml/metadata' uses an untrusted filesystem 'com.databricks.mlflowdbfs.MlflowdbfsFileSystem', but your administrator has configured Spark to only allow trusted filesystems: (com.databricks.s3a.S3AFileSystem, shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystemHadoop3, shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem, shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem, shaded.databricks.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem, com.databricks.adl.AdlFileSystem, shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystemHadoop3, shaded.databricks.V2_1_4.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem, shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem, shaded.databricks.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemHadoop3, shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem)

here is the code that I use :

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
import mlflow
from mlflow import spark
 
# Start an MLflow run and set experiment
with mlflow.start_run():
    mlflow.set_experiment("/Users/my-id/experiments")
 
    # Read in data from a CSV file
    data = spark.read.csv("dbfs:/FileStore/tables/data.csv", header=True, inferSchema=True)
 
    # Preprocess data
    labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
    assembler = VectorAssembler(inputCols=data.columns[:-1], outputCol="features")
    pipeline = Pipeline(stages=[labelIndexer, assembler])
    preprocessedData = pipeline.fit(data).transform(data)
 
    # Split data into training and test sets
    (trainingData, testData) = preprocessedData.randomSplit([0.7, 0.3])
 
    # Define model and hyperparameters to tune
    rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="features")
    paramGrid = ParamGridBuilder() \
        .addGrid(rf.numTrees, [10, 20, 30]) \
        .addGrid(rf.maxDepth, [5, 10, 15]) \
        .build()
 
    # Evaluate model using area under ROC
    evaluator = BinaryClassificationEvaluator(labelCol="indexedLabel", metricName="areaUnderROC")
 
    # Perform cross-validation to tune hyperparameters
    cv = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)
    cvModel = cv.fit(trainingData)
 
    # Log model and its metrics
    mlflow.spark.log_model(spark_model=cvModel.bestModel, artifact_path="best_model")

does anyone know how to solve this issue?

thanks in advance!

Anonymous · ‎04-09-2023

@Saeid Hedayati :

The error message indicates that the mlflow.spark.log_model function is attempting to save the model metadata to an untrusted filesystem called com.databricks.mlflowdbfs.MlflowdbfsFileSystem, but Spark has been configured to only allow trusted filesystems.

One potential solution to this issue is to explicitly set the filesystem type used by MLflow to a trusted filesystem like S3 or Azure Blob Storage. You can do this by setting the MLFLOW_EXPERIMENT_STORAGE environment variable to the desired filesystem type.

For example, if you are using S3 as your artifact store, you can set the MLFLOW_EXPERIMENT_STORAGE

environment variable as follows:

import os
os.environ['MLFLOW_EXPERIMENT_STORAGE'] = 's3://my-bucket/mlflow'

Replace my-bucket with the name of your S3 bucket and mlflow with the desired path in the bucket.

Alternatively, you can try saving the model metadata to a local filesystem instead of a DBFS path by specifying a local path for the artifact_uri parameter of the mlflow.start_run function:

with mlflow.start_run(artifact_uri='/path/to/local/dir'):
    # ...
    mlflow.spark.log_model(spark_model=cvModel.bestModel, artifact_path="best_model")

Replace /path/to/local/dir with the path to a local directory where you want to save the model metadata.

Anonymous · ‎04-21-2023

Hi @Saeid Hedayati

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!

Databricks

Logging spark pipeline model using mlflow spark , leads to PythonSecurityException

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI