Databricks Community

vas610 · ‎08-06-2021

I'm getting the following error when I'm trying to load a h2o model using mlflow for prediction

Error:

   Error
   Job with key $03017f00000132d4ffffffff$_990da74b0db027b33cc49d1d90934149 failed with an exception: java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set

Source code:

# !pip install requests # !pip install tabulate # !pip install "colorama>=0.3.8" # !pip install future # !pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o # !pip install mlflow # !wget https://github.com/mlflow/mlflow-example/blob/master/wine-quality.csv

 import h2o
 import random
 import mlflow
 import mlflow.h2o
 from h2o.estimators.random_forest import H2ORandomForestEstimator
 h2o.init()
 wine = h2o.import_file(path="winequality.csv")
 r = wine['quality'].runif()
 train = wine[r  &lt; 0.7]
 test  = wine[0.3 &lt;= r]
 mlflow.set_tracking_uri('https://mlflow.xxxxxxx.cloud/')
 mlflow.set_experiment("H2ORandomForestEstimator")
 
 def train_random_forest(ntrees):
     with mlflow.start_run():
         rf = H2ORandomForestEstimator(ntrees=ntrees)
         train_cols = [n for n in wine.col_names if n != "quality"]
         rf.train(train_cols, "quality", training_frame=train, validation_frame=test)      
         mlflow.log_param("ntrees", ntrees)        
         mlflow.log_metric("rmse", rf.rmse())
         mlflow.log_metric("r2", rf.r2())
         mlflow.log_metric("mae", rf.mae())       
         mlflow.h2o.log_model(rf, "model")        
         h2o.save_model(rf)            
         predict = rf.predict(test)        
         print(predict.head())

 for ntrees in [10, 20, 50, 100]:
     train_random_forest(ntrees)</pre><pre>import mlflow
 logged_model = 's3://mlflow-sagemaker/1/66f7c015fe8d4fb080940f3d31003f49/artifacts/model'

 # Load model as a PyFuncModel.
 loaded_model = mlflow.pyfunc.load_model(logged_model)

 # Predict on a Pandas DataFrame.
 import pandas as pd
 loaded_model.predict(pd.DataFrame(test))</pre>

Dan_Z · ‎08-06-2021

I ran this in Databricks and it worked with no issues. I suggest you make sure your wget path is correct, because the one you posted downloads HTML, not the raw csv. That may cause the problem.

%sh
wget https://raw.githubusercontent.com/mlflow/mlflow-example/master/wine-quality.csv

import h2o import random import mlflow import mlflow.h2o from h2o.estimators.random_forest import H2ORandomForestEstimator

h2o.init() wine = h2o.import_file(path="./wine-quality.csv") r = wine['quality'].runif() train = wine[r < 0.7] test = wine[0.3 <= r]

def train_random_forest(ntrees): with mlflow.start_run(): rf = H2ORandomForestEstimator(ntrees=ntrees) train_cols = [n for n in wine.col_names if n != "quality"] rf.train(train_cols, "quality", training_frame=train, validation_frame=test)

mlflow.log_param("ntrees", ntrees)

mlflow.log_metric("rmse", rf.rmse()) mlflow.log_metric("r2", rf.r2()) mlflow.log_metric("mae", rf.mae())

mlflow.h2o.log_model(rf, "model")

h2o.save_model(rf)

predict = rf.predict(test)

print(predict.head()) for ntrees in [10, 20, 50, 100]: train_random_forest(ntrees

vas610 · ‎08-09-2021

@Dan Zafar I mentioned the incorrect path in the original question but I did train the model with correct file.

!wget https://raw.githubusercontent.com/mlflow/mlflow-example/master/wine-quality.csv

There is no issues when trying to predict using the h2o model object. But the prediction fails when using the MLFLOW's pyfunc flavour

vas610 · ‎08-09-2021

import mlflow logged_model = 's3://mlflow-s3 sagemaker/1/58e5371188ed4t649d2d75686a9f155d/artifacts/model' 
# Load model as a PyFuncModel. 
loaded_model = mlflow.pyfunc.load_model(logged_model) 
# Predict on a Pandas DataFrame. import pandas as pd 
loaded_model.predict(pd.DataFrame(test))

vas610 · ‎08-09-2021

Error

OSError: Job with key $03017f00000132d4ffffffff$_9993cede52525f90fe9729b1ddb24cf7 failed with an exception: java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set stacktrace: java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set at hex.Model.adaptTestForTrain(Model.java:1568)

vas610 · ‎08-09-2021

Error

stacktrace: java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set at hex.Model.adaptTestForTrain(Model.java:1568) at hex.Model.adaptTestForTrain(Model.java:1404) at hex.Model.adaptTestForTrain(Model.java:1400) at hex.Model.score(Model.java:1697) at water.api.ModelMetricsHandler$1.compute2(ModelMetricsHandler.java:422) at water.H2O$H2OCountedCompleter.compute(H2O.java:1637)