topic Re: Parallelization in training machine learning models using MLFlow in Machine Learning

Parallelization in training machine learning models using MLFlow

ianchenmu — Thu, 13 Oct 2022 14:55:19 GMT

I'm training a ML model (e.g., XGboost) and I have a large combination of 5 hyperparameters, say each parameter has 5 candidates, it will be 5^5 = 3,125 combos.

Now I want to do parallelization for the grid search on all the hyperparameter combos for training a machine learning model to get the best performance of the model.

So how can I achieve this on Databricks, especially using MLFlow? I've been told I can define a function to train and evaluate the model (using mlflow) and defining an array with all of the hyper-parameter combinations, sc.parallelize the array and then mapping the function over.

I have come up with the code for the sc.parallelize the array, like

paras_combo_test =  [(x, y) for x in [50, 100, 150] for y in [0.8,0.9,0.95]]
sc.parallelize(paras_combo_test, 3).glom().collect()

(for simplicit, I'm just using two parameters x, y and there are 9 combos in total and I divided them to 3 partitions.)

How can I map over the function which does the model training with evaluation (probably using mlflow), so that there will be 3 works (each work will train 3 models) in parallel from the partitions of parameter combos I have?

Re: Parallelization in training machine learning models using MLFlow

Anonymous — Thu, 13 Oct 2022 15:15:53 GMT

This blog should be very helpful:

https://www.databricks.com/blog/2021/04/15/how-not-to-tune-your-model-with-hyperopt.html

Here are the docs on xgboost

https://docs.databricks.com/machine-learning/train-model/xgboost.html

A simple rule is never use sc.parallelize.

Re: Parallelization in training machine learning models using MLFlow

ianchenmu — Thu, 13 Oct 2022 17:00:41 GMT

Thanks @Joseph Kambourakis ! It seems we could do the distributed XGBoost training using the num_workers regards to how many workers in the cluster. But can we also speed up by setting a parameter utilizing the number of cores in the cluster?

Re: Parallelization in training machine learning models using MLFlow

Anonymous — Thu, 13 Oct 2022 17:58:54 GMT

You can set num_workers to your default parallelism

https://databricks.github.io/spark-deep-learning/_modules/sparkdl/xgboost/xgboost.html

Re: Parallelization in training machine learning models using MLFlow

Hubert-Dudek — Tue, 18 Oct 2022 13:35:16 GMT

collect() is working on the driver and will not offer any parallelism but rather OOM error.

Re: Parallelization in training machine learning models using MLFlow

Anonymous — Mon, 21 Nov 2022 07:45:45 GMT

Hi @Chen Mu

Hope all is well!

Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!