Databricks Community

Joseph_B · ‎06-24-2021

Is there a way to automatically distribute training and model tuning across a Spark cluster, if I want to keep using scikit-learn?

Joseph_B · ‎06-24-2021

It depends on what you mean by "automagically."

If you want to keep using scikit-learn, there are ways to distribute parts of training and tuning with minimal effort. However, there is no "magic" way to distribute training an individual model in scikit-learn; it is fundamentally a single-machine ML library, so training a model (e.g., a decision tree) in a distributed way requires a different implementation (like in Apache Spark MLlib).

You can distribute some parts of the workflow easily:

Model tuning and cross validation
Data prep and featurization

Good tools for distributing these workloads with scikit-learn include:

Hyperopt with SparkTrials: Hyperopt is a Python library for adaptive (smart & efficient) hyperparameter tuning, and there is a SparkTrials component which lets you scale tuning across a Spark cluster. See the Databricks docs (AWS, Azure, GCP) and the Hyperopt SparkTrials docs for more info.
joblib-spark: Some algorithms in scikit-learn (especially the tuning and cross-validation tools) let you specify a parallel backend. You can use the joblib-spark backend to use Spark as that parallel backend. See the joblib-spark github page for an example.
Koalas: This provides a Pandas API backed by Spark. Great for data prep. See the Koalas website for more info, and know that the Spark community plans to include this in future Spark releases.
Pandas UDFs in Spark DataFrames: These let you specify arbitrary code (such as scikit-learn featurization logic) in operations on distributed DataFrames. See these docs for more info (AWS, Azure, GCP).

Databricks Community

How can I use Databricks to "automagically" distribute scikit-learn model training?

li.media.uploader-dialog.title

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!