I'm training a ML model (e.g., XGboost) and I have a large combination of 5 hyperparameters, say each parameter has 5 candidates, it will be 5^5 = 3,125 combos.
Now I want to do parallelization for the grid search on all the hyperparameter combos for training a machine learning model to get the best performance of the model.
So how can I achieve this on Databricks, especially using MLFlow? I've been told I can define a function to train and evaluate the model (using mlflow) and defining an array with all of the hyper-parameter combinations, sc.parallelize the array and then mapping the function over.
I have come up with the code for the sc.parallelize the array, like
paras_combo_test = [(x, y) for x in [50, 100, 150] for y in [0.8,0.9,0.95]]
sc.parallelize(paras_combo_test, 3).glom().collect()
(for simplicit, I'm just using two parameters x, y and there are 9 combos in total and I divided them to 3 partitions.)
How can I map over the function which does the model training with evaluation (probably using mlflow), so that there will be 3 works (each work will train 3 models) in parallel from the partitions of parameter combos I have?