cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I use Databricks to "automagically" distribute scikit-learn model training?

Joseph_B
New Contributor III
New Contributor III

Is there a way to automatically distribute training and model tuning across a Spark cluster, if I want to keep using scikit-learn?

1 REPLY 1

Joseph_B
New Contributor III
New Contributor III

It depends on what you mean by "automagically."

If you want to keep using scikit-learn, there are ways to distribute parts of training and tuning with minimal effort. However, there is no "magic" way to distribute training an individual model in scikit-learn; it is fundamentally a single-machine ML library, so training a model (e.g., a decision tree) in a distributed way requires a different implementation (like in Apache Spark MLlib).

You can distribute some parts of the workflow easily:

  • Model tuning and cross validation
  • Data prep and featurization

Good tools for distributing these workloads with scikit-learn include:

  • Hyperopt with SparkTrials: Hyperopt is a Python library for adaptive (smart & efficient) hyperparameter tuning, and there is a SparkTrials component which lets you scale tuning across a Spark cluster. See the Databricks docs (AWS, Azure, GCP) and the Hyperopt SparkTrials docs for more info.
  • joblib-spark: Some algorithms in scikit-learn (especially the tuning and cross-validation tools) let you specify a parallel backend. You can use the joblib-spark backend to use Spark as that parallel backend. See the joblib-spark github page for an example.
  • Koalas: This provides a Pandas API backed by Spark. Great for data prep. See the Koalas website for more info, and know that the Spark community plans to include this in future Spark releases.
  • Pandas UDFs in Spark DataFrames: These let you specify arbitrary code (such as scikit-learn featurization logic) in operations on distributed DataFrames. See these docs for more info (AWS, Azure, GCP).

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!