It depends on what you mean by "automagically."
If you want to keep using scikit-learn, there are ways to distribute parts of training and tuning with minimal effort. However, there is no "magic" way to distribute training an individual model in scikit-learn; it is fundamentally a single-machine ML library, so training a model (e.g., a decision tree) in a distributed way requires a different implementation (like in Apache Spark MLlib).
You can distribute some parts of the workflow easily:
- Model tuning and cross validation
- Data prep and featurization
Good tools for distributing these workloads with scikit-learn include:
- Hyperopt with SparkTrials: Hyperopt is a Python library for adaptive (smart & efficient) hyperparameter tuning, and there is a SparkTrials component which lets you scale tuning across a Spark cluster. See the Databricks docs (AWS, Azure, GCP) and the Hyperopt SparkTrials docs for more info.
- joblib-spark: Some algorithms in scikit-learn (especially the tuning and cross-validation tools) let you specify a parallel backend. You can use the joblib-spark backend to use Spark as that parallel backend. See the joblib-spark github page for an example.
- Koalas: This provides a Pandas API backed by Spark. Great for data prep. See the Koalas website for more info, and know that the Spark community plans to include this in future Spark releases.
- Pandas UDFs in Spark DataFrames: These let you specify arbitrary code (such as scikit-learn featurization logic) in operations on distributed DataFrames. See these docs for more info (AWS, Azure, GCP).