How can I use Databricks to "automagically" distribute scikit-learn model training?

Joseph_B — Thu, 24 Jun 2021 20:29:49 GMT

Is there a way to automatically distribute training and model tuning across a Spark cluster, if I want to keep using scikit-learn?

Re: How can I use Databricks to "automagically" distribute scikit-learn model training?

Joseph_B — Thu, 24 Jun 2021 20:42:11 GMT

It depends on what you mean by "automagically."

If you want to keep using scikit-learn, there are ways to distribute parts of training and tuning with minimal effort. However, there is no "magic" way to distribute training an individual model in scikit-learn; it is fundamentally a single-machine ML library, so training a model (e.g., a decision tree) in a distributed way requires a different implementation (like in Apache Spark MLlib).

You can distribute some parts of the workflow easily:

Model tuning and cross validation
Data prep and featurization

Good tools for distributing these workloads with scikit-learn include:

Hyperopt with SparkTrials: Hyperopt is a Python library for adaptive (smart & efficient) hyperparameter tuning, and there is a SparkTrials component which lets you scale tuning across a Spark cluster. See the Databricks docs (AWS, Azure, GCP) and the Hyperopt SparkTrials docs for more info.
joblib-spark: Some algorithms in scikit-learn (especially the tuning and cross-validation tools) let you specify a parallel backend. You can use the joblib-spark backend to use Spark as that parallel backend. See the joblib-spark github page for an example.
Koalas: This provides a Pandas API backed by Spark. Great for data prep. See the Koalas website for more info, and know that the Spark community plans to include this in future Spark releases.
Pandas UDFs in Spark DataFrames: These let you specify arbitrary code (such as scikit-learn featurization logic) in operations on distributed DataFrames. See these docs for more info (AWS, Azure, GCP).

topic How can I use Databricks to "automagically" distribute scikit-learn model training? in Data Engineering

How can I use Databricks to "automagically" distribute scikit-learn model training?

Re: How can I use Databricks to "automagically" distribute scikit-learn model training?