Databricks Community

Miki · ‎03-15-2024

I am logging a trained keras model using the following:

 fe.log_model(
 model=model,
 artifact_path="wine_quality_prediction",
 flavor= mlflow.keras,
 training_set=training_set,
 registered_model_name=model_name
 )

And when I call the following:

predictions_df = fe.score_batch(model_uri=f"models:/{model_name}/{latest_model_version}", df=batch_input_df)
display(predictions_df)

I get the following error:

OSError: [Errno 30] Read-only file system: '/local_disk0/.ephemeral_nfs/repl_tmp_data/ReplId-62b1a-a1cdf-baa5b-1/mlflow/models/tmpajr94lkz/raw_model/data/model.keras

I get the same

I am essentially just trying to adapt this example to use a keras model instead of a Random Forest. The code runs fine with a Random Forest, and it runs fine if I just use the native mlflow log_model(), load_model() and predict() functions. However, I do get the error if I just use mlflow.pyfunc.load_model() to load the model and call model.predict(). This makes me think the bug is specific to the way in which the databricks FeatureEngineeringClient module is saving the keras model.

I would appreciate any help with this issue.

Miki · ‎03-18-2024

Hi Kaniz,

Thanks for the response. Apologies if I am missing something, but since I am directly using the databricks FeatureEngineeringClient.log_model() method, I am not given the option to specify the path to write the model to. The only parameter I am given the option to provide is the artifact path and the model name, neither of which give me enough control to implement the solutions you are suggesting. I could potentially define a custom pyfunc rather than using the existing mlflow.keras flavor and then define my own save_model() and load_model() functions. However, I am struggling to see why this error is happening only when I am using the FeatureEngineeringClient() to log and load my model, while this all works fine when I use the mlflow logging and loading (although this prevents me from leveraging the automatic feature lookups provided by the feature store).

Am I missing something?

Miki · ‎03-18-2024

I figured it out based on this issue that someone posted. By switching to a single-node cluster (meaning worker node permissions are irrelevant), this code works now.