<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How can I use Databricks to &amp;quot;automagically&amp;quot; distribute scikit-learn model training? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-can-i-use-databricks-to-quot-automagically-quot-distribute/m-p/20366#M13737</link>
    <description>&lt;P&gt;Is there a way to automatically distribute training and model tuning across a Spark cluster, if I want to keep using scikit-learn?&lt;/P&gt;</description>
    <pubDate>Thu, 24 Jun 2021 20:29:49 GMT</pubDate>
    <dc:creator>Joseph_B</dc:creator>
    <dc:date>2021-06-24T20:29:49Z</dc:date>
    <item>
      <title>How can I use Databricks to "automagically" distribute scikit-learn model training?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-can-i-use-databricks-to-quot-automagically-quot-distribute/m-p/20366#M13737</link>
      <description>&lt;P&gt;Is there a way to automatically distribute training and model tuning across a Spark cluster, if I want to keep using scikit-learn?&lt;/P&gt;</description>
      <pubDate>Thu, 24 Jun 2021 20:29:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-can-i-use-databricks-to-quot-automagically-quot-distribute/m-p/20366#M13737</guid>
      <dc:creator>Joseph_B</dc:creator>
      <dc:date>2021-06-24T20:29:49Z</dc:date>
    </item>
    <item>
      <title>Re: How can I use Databricks to "automagically" distribute scikit-learn model training?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-can-i-use-databricks-to-quot-automagically-quot-distribute/m-p/20367#M13738</link>
      <description>&lt;P&gt;It depends on what you mean by "automagically."&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If you want to keep using scikit-learn, there are ways to distribute parts of training and tuning with minimal effort.  However, there is no "magic" way to distribute training an individual model in scikit-learn; it is fundamentally a single-machine ML library, so training a model (e.g., a decision tree) in a distributed way requires a different implementation (like in Apache Spark MLlib).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;You can distribute some parts of the workflow easily&lt;/B&gt;:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Model tuning and cross validation&lt;/LI&gt;&lt;LI&gt;Data prep and featurization&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Good tools for distributing these workloads with scikit-learn include&lt;/B&gt;:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Hyperopt with SparkTrials: Hyperopt is a Python library for adaptive (smart &amp;amp; efficient) hyperparameter tuning, and there is a SparkTrials component which lets you scale tuning across a Spark cluster.  See the Databricks docs (&lt;A href="https://docs.databricks.com/applications/machine-learning/automl-hyperparam-tuning/index.html#hyperparameter-tuning-with-hyperopt" alt="https://docs.databricks.com/applications/machine-learning/automl-hyperparam-tuning/index.html#hyperparameter-tuning-with-hyperopt" target="_blank"&gt;AWS&lt;/A&gt;, &lt;A href="https://docs.microsoft.com/en-us/azure/databricks/applications/machine-learning/automl-hyperparam-tuning/#--hyperparameter-tuning-with-hyperopt" alt="https://docs.microsoft.com/en-us/azure/databricks/applications/machine-learning/automl-hyperparam-tuning/#--hyperparameter-tuning-with-hyperopt" target="_blank"&gt;Azure&lt;/A&gt;, &lt;A href="https://docs.gcp.databricks.com/applications/machine-learning/automl-hyperparam-tuning/index.html#hyperparameter-tuning-with-hyperopt" alt="https://docs.gcp.databricks.com/applications/machine-learning/automl-hyperparam-tuning/index.html#hyperparameter-tuning-with-hyperopt" target="_blank"&gt;GCP&lt;/A&gt;) and the &lt;A href="http://hyperopt.github.io/hyperopt/scaleout/spark/" alt="http://hyperopt.github.io/hyperopt/scaleout/spark/" target="_blank"&gt;Hyperopt SparkTrials docs&lt;/A&gt; for more info.&lt;/LI&gt;&lt;LI&gt;joblib-spark: Some algorithms in scikit-learn (especially the tuning and cross-validation tools) let you specify a parallel backend.  You can use the joblib-spark backend to use Spark as that parallel backend.  See the &lt;A href="https://github.com/joblib/joblib-spark" alt="https://github.com/joblib/joblib-spark" target="_blank"&gt;joblib-spark github page&lt;/A&gt; for an example.&lt;/LI&gt;&lt;LI&gt;Koalas: This provides a Pandas API backed by Spark.  Great for data prep.  See the &lt;A href="https://koalas.readthedocs.io/en/latest/" alt="https://koalas.readthedocs.io/en/latest/" target="_blank"&gt;Koalas website&lt;/A&gt; for more info, and know that the Spark community plans to include this in future Spark releases.&lt;/LI&gt;&lt;LI&gt;Pandas UDFs in Spark DataFrames: These let you specify arbitrary code (such as scikit-learn featurization logic) in operations on distributed DataFrames.  See these docs for more info (&lt;A href="https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html" alt="https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html" target="_blank"&gt;AWS&lt;/A&gt;, &lt;A href="https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/udf-python-pandas" alt="https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/udf-python-pandas" target="_blank"&gt;Azure&lt;/A&gt;, &lt;A href="https://docs.gcp.databricks.com/spark/latest/spark-sql/udf-python-pandas.html" alt="https://docs.gcp.databricks.com/spark/latest/spark-sql/udf-python-pandas.html" target="_blank"&gt;GCP&lt;/A&gt;).&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 24 Jun 2021 20:42:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-can-i-use-databricks-to-quot-automagically-quot-distribute/m-p/20367#M13738</guid>
      <dc:creator>Joseph_B</dc:creator>
      <dc:date>2021-06-24T20:42:11Z</dc:date>
    </item>
  </channel>
</rss>

