<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How do I distribute machine learning process in my spark data frame in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/how-do-i-distribute-machine-learning-process-in-my-spark-data/m-p/57507#M2866</link>
    <description>&lt;P&gt;I have tried using pyspark.ml, and I used the link you sent me to mimic the process. The data I'm using is pretty large and takes over 30 mins to run. I have not written any code to update the nodes. I want to learn how to use more than two nodes to increase the performance so I can add more features.&lt;/P&gt;</description>
    <pubDate>Tue, 16 Jan 2024 19:32:13 GMT</pubDate>
    <dc:creator>mohaimen_syed</dc:creator>
    <dc:date>2024-01-16T19:32:13Z</dc:date>
    <item>
      <title>How do I distribute machine learning process in my spark data frame</title>
      <link>https://community.databricks.com/t5/machine-learning/how-do-i-distribute-machine-learning-process-in-my-spark-data/m-p/55524#M2801</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I'm trying to use around 5 numerical features on 3.5 million rows to train and test my model with a spark data frame.My cluster has 60 nodes available but is only using 2. How can I distribute the process or make it for efficient and faster.&lt;/P&gt;&lt;P&gt;My code:&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;vector_assembler &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;VectorAssembler&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;inputCols&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;feature_columns, &lt;/SPAN&gt;&lt;SPAN&gt;outputCol&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"features"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;# Random Forest Classifier&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;rf &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;RandomForestClassifier&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;featuresCol&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"features"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;labelCol&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"target"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;numTrees&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;100&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;# Pipeline&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;pipeline &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;Pipeline&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;stages&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;[vector_assembler, rf])&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;# Hyperparameter tuning using Cross-Validation&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;param_grid &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;ParamGridBuilder&lt;/SPAN&gt;&lt;SPAN&gt;().&lt;/SPAN&gt;&lt;SPAN&gt;build&lt;/SPAN&gt;&lt;SPAN&gt;()&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;evaluator &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;BinaryClassificationEvaluator&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;labelCol&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"target"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;rawPredictionCol&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"rawPrediction"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;metricName&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"areaUnderROC"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;cross_validator &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;CrossValidator&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;estimator&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;pipeline, &lt;/SPAN&gt;&lt;SPAN&gt;estimatorParamMaps&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;param_grid, &lt;/SPAN&gt;&lt;SPAN&gt;evaluator&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;evaluator, &lt;/SPAN&gt;&lt;SPAN&gt;numFolds&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;5&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;# Train the model&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;cv_model &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; cross_validator.&lt;/SPAN&gt;&lt;SPAN&gt;fit&lt;/SPAN&gt;&lt;SPAN&gt;(df)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;# Make predictions&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;predictions &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; cv_model.&lt;/SPAN&gt;&lt;SPAN&gt;transform&lt;/SPAN&gt;&lt;SPAN&gt;(df)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;# Evaluate the model&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;evaluator &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;MulticlassClassificationEvaluator&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;labelCol&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"target"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;predictionCol&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"prediction"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;metricName&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"accuracy"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;accuracy &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; evaluator.&lt;/SPAN&gt;&lt;SPAN&gt;evaluate&lt;/SPAN&gt;&lt;SPAN&gt;(predictions)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;print&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;f&lt;/SPAN&gt;&lt;SPAN&gt;"Accuracy: &lt;/SPAN&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;SPAN&gt;accuracy&lt;/SPAN&gt;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 19 Dec 2023 19:39:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/how-do-i-distribute-machine-learning-process-in-my-spark-data/m-p/55524#M2801</guid>
      <dc:creator>mohaimen_syed</dc:creator>
      <dc:date>2023-12-19T19:39:39Z</dc:date>
    </item>
    <item>
      <title>Re: How do I distribute machine learning process in my spark data frame</title>
      <link>https://community.databricks.com/t5/machine-learning/how-do-i-distribute-machine-learning-process-in-my-spark-data/m-p/56994#M2839</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/91852"&gt;@mohaimen_syed&lt;/a&gt;&amp;nbsp; - can you please try using pyspark.ml implementation of randomForestClassifier instead of sklearn and see if it works. Below is an example - &lt;A href="https://github.com/apache/spark/blob/master/examples/src/main/python/ml/random_forest_classifier_example.py" target="_blank"&gt;https://github.com/apache/spark/blob/master/examples/src/main/python/ml/random_forest_classifier_example.py&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Thanks, Shan&lt;/P&gt;</description>
      <pubDate>Thu, 11 Jan 2024 21:02:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/how-do-i-distribute-machine-learning-process-in-my-spark-data/m-p/56994#M2839</guid>
      <dc:creator>shan_chandra</dc:creator>
      <dc:date>2024-01-11T21:02:16Z</dc:date>
    </item>
    <item>
      <title>Re: How do I distribute machine learning process in my spark data frame</title>
      <link>https://community.databricks.com/t5/machine-learning/how-do-i-distribute-machine-learning-process-in-my-spark-data/m-p/56998#M2840</link>
      <description>&lt;P&gt;Thank you for your reply&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/616"&gt;@shan_chandra&lt;/a&gt;&amp;nbsp;. I looked at this code and tried doing the same thing. The cluster uses 2 nodes at most, even though there's 60 available. I believe the advantage of using Databricks is to use the distributed compute method, but I'm not sure how to effectively use it.&lt;/P&gt;</description>
      <pubDate>Thu, 11 Jan 2024 21:55:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/how-do-i-distribute-machine-learning-process-in-my-spark-data/m-p/56998#M2840</guid>
      <dc:creator>mohaimen_syed</dc:creator>
      <dc:date>2024-01-11T21:55:10Z</dc:date>
    </item>
    <item>
      <title>Re: How do I distribute machine learning process in my spark data frame</title>
      <link>https://community.databricks.com/t5/machine-learning/how-do-i-distribute-machine-learning-process-in-my-spark-data/m-p/57483#M2858</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/91852"&gt;@mohaimen_syed&lt;/a&gt;&amp;nbsp;- There are many reasons why only 2 nodes are used at the most.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;1. sklearn implementation of randomforest classifier is not distributed. Please use pyspark.ml implementation&lt;/P&gt;
&lt;P&gt;2. your dataframe may be small enough.&lt;/P&gt;
&lt;P&gt;Always start with a small number of nodes and modify the number of nodes based on your workload.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 16 Jan 2024 15:43:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/how-do-i-distribute-machine-learning-process-in-my-spark-data/m-p/57483#M2858</guid>
      <dc:creator>shan_chandra</dc:creator>
      <dc:date>2024-01-16T15:43:27Z</dc:date>
    </item>
    <item>
      <title>Re: How do I distribute machine learning process in my spark data frame</title>
      <link>https://community.databricks.com/t5/machine-learning/how-do-i-distribute-machine-learning-process-in-my-spark-data/m-p/57507#M2866</link>
      <description>&lt;P&gt;I have tried using pyspark.ml, and I used the link you sent me to mimic the process. The data I'm using is pretty large and takes over 30 mins to run. I have not written any code to update the nodes. I want to learn how to use more than two nodes to increase the performance so I can add more features.&lt;/P&gt;</description>
      <pubDate>Tue, 16 Jan 2024 19:32:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/how-do-i-distribute-machine-learning-process-in-my-spark-data/m-p/57507#M2866</guid>
      <dc:creator>mohaimen_syed</dc:creator>
      <dc:date>2024-01-16T19:32:13Z</dc:date>
    </item>
  </channel>
</rss>

