cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Error loading h2o model in mlflow

vas610
New Contributor III

I'm getting the following error when I'm trying to load a h2o model using mlflow for prediction

Error:

   Error
   Job with key $03017f00000132d4ffffffff$_990da74b0db027b33cc49d1d90934149 failed with an exception: java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set 

Source code:

# !pip install requests # !pip install tabulate # !pip install "colorama>=0.3.8" # !pip install future # !pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o # !pip install mlflow # !wget https://github.com/mlflow/mlflow-example/blob/master/wine-quality.csv

 import h2o
 import random
 import mlflow
 import mlflow.h2o
 from h2o.estimators.random_forest import H2ORandomForestEstimator
 h2o.init()
 wine = h2o.import_file(path="winequality.csv")
 r = wine['quality'].runif()
 train = wine[r  < 0.7]
 test  = wine[0.3 <= r]
 mlflow.set_tracking_uri('https://mlflow.xxxxxxx.cloud/')
 mlflow.set_experiment("H2ORandomForestEstimator")
 
 def train_random_forest(ntrees):
     with mlflow.start_run():
         rf = H2ORandomForestEstimator(ntrees=ntrees)
         train_cols = [n for n in wine.col_names if n != "quality"]
         rf.train(train_cols, "quality", training_frame=train, validation_frame=test)      
         mlflow.log_param("ntrees", ntrees)        
         mlflow.log_metric("rmse", rf.rmse())
         mlflow.log_metric("r2", rf.r2())
         mlflow.log_metric("mae", rf.mae())       
         mlflow.h2o.log_model(rf, "model")        
         h2o.save_model(rf)            
         predict = rf.predict(test)        
         print(predict.head())

 for ntrees in [10, 20, 50, 100]:
     train_random_forest(ntrees)</pre><pre>import mlflow
 logged_model = 's3://mlflow-sagemaker/1/66f7c015fe8d4fb080940f3d31003f49/artifacts/model'

 # Load model as a PyFuncModel.
 loaded_model = mlflow.pyfunc.load_model(logged_model)

 # Predict on a Pandas DataFrame.
 import pandas as pd
 loaded_model.predict(pd.DataFrame(test))</pre>

5 REPLIES 5

Dan_Z
Databricks Employee
Databricks Employee

I ran this in Databricks and it worked with no issues. I suggest you make sure your wget path is correct, because the one you posted downloads HTML, not the raw csv. That may cause the problem.

%sh
wget https://raw.githubusercontent.com/mlflow/mlflow-example/master/wine-quality.csv

import h2o import random import mlflow import mlflow.h2o from h2o.estimators.random_forest import H2ORandomForestEstimator

h2o.init() wine = h2o.import_file(path="./wine-quality.csv") r = wine['quality'].runif() train = wine[r < 0.7] test = wine[0.3 <= r]

def train_random_forest(ntrees): with mlflow.start_run(): rf = H2ORandomForestEstimator(ntrees=ntrees) train_cols = [n for n in wine.col_names if n != "quality"] rf.train(train_cols, "quality", training_frame=train, validation_frame=test)

mlflow.log_param("ntrees", ntrees)

mlflow.log_metric("rmse", rf.rmse()) mlflow.log_metric("r2", rf.r2()) mlflow.log_metric("mae", rf.mae())

mlflow.h2o.log_model(rf, "model")

h2o.save_model(rf)

predict = rf.predict(test)

print(predict.head()) for ntrees in [10, 20, 50, 100]: train_random_forest(ntrees

vas610
New Contributor III

@Dan Zafar I mentioned the incorrect path in the original question but I did train the model with correct file.

!wget https://raw.githubusercontent.com/mlflow/mlflow-example/master/wine-quality.csv

There is no issues when trying to predict using the h2o model object. But the prediction fails when using the MLFLOW's pyfunc flavour

vas610
New Contributor III

import mlflow logged_model = 's3://mlflow-s3 sagemaker/1/58e5371188ed4t649d2d75686a9f155d/artifacts/model' 
# Load model as a PyFuncModel. 
loaded_model = mlflow.pyfunc.load_model(logged_model) 
# Predict on a Pandas DataFrame. import pandas as pd 
loaded_model.predict(pd.DataFrame(test))

vas610
New Contributor III

Error

OSError: Job with key $03017f00000132d4ffffffff$_9993cede52525f90fe9729b1ddb24cf7 failed with an exception: java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set stacktrace: java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set at hex.Model.adaptTestForTrain(Model.java:1568)

vas610
New Contributor III

Error

 

stacktrace: java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set at hex.Model.adaptTestForTrain(Model.java:1568) at hex.Model.adaptTestForTrain(Model.java:1404) at hex.Model.adaptTestForTrain(Model.java:1400) at hex.Model.score(Model.java:1697) at water.api.ModelMetricsHandler$1.compute2(ModelMetricsHandler.java:422) at water.H2O$H2OCountedCompleter.compute(H2O.java:1637)

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group