<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Cluster setup for ML work for Pandas in Spark, and vanilla Python. in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/cluster-setup-for-ml-work-for-pandas-in-spark-and-vanilla-python/m-p/31195#M1663</link>
    <description>&lt;P&gt;@Vivek Ranjan​&amp;nbsp;- Does Joseph's answer help? If it does, would you be happy to mark it as best? If it doesn't, please tell us so we can help you. &lt;/P&gt;</description>
    <pubDate>Mon, 07 Mar 2022 17:33:38 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2022-03-07T17:33:38Z</dc:date>
    <item>
      <title>Cluster setup for ML work for Pandas in Spark, and vanilla Python.</title>
      <link>https://community.databricks.com/t5/machine-learning/cluster-setup-for-ml-work-for-pandas-in-spark-and-vanilla-python/m-p/31192#M1660</link>
      <description>&lt;P&gt;My setup:&lt;/P&gt;&lt;P&gt;Worker type: Standard_D32d_v4, 128 GB Memory, 32 Cores, Min Workers: 2, Max Workers: 8&lt;/P&gt;&lt;P&gt;Driver type:  Standard_D32ds_v4, 128 GB Memory, 32 Cores&lt;/P&gt;&lt;P&gt;Databricks Runtime Version: 10.2 ML (includes Apache Spark 3.2.0, Scala 2.12)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I ran a snowflake query and pulled in two datasets 30 million rows and 7 columns. Saved them as pyspark.pandas.frame.DataFrame, call them df1, and df2 (the two dataframes)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;1st column of each of these datasets is a household_id. I want to check how many household_id from df1 is not in df2. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried two different ways: &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;len(set(df1['household_id'].to_list).difference(df2['household_id'].to_list()))&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df1['household_id'].isin(df2['household_id'].to_list()).value_counts()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The above two fail because of out of memory issue.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;My questions are:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Where is the python list computation happening as in first code snippet? Is it on driver node or worker node? I believe that code is being run in a single node and not distributed?&lt;/LI&gt;&lt;LI&gt;Is there a way to better debug out of memory issue? Such as which piece of code? Which node the code failed., etc.&lt;/LI&gt;&lt;LI&gt;What is the best guidance on creating a cluster? This could depend on understanding how pieces of code will run such as distributed across worker nodes, or running on a single driver . node. Is there a general guidance if driver node should be beefier (larger memory and cores) as compared to worker nodes or vice-versa?&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 21 Jan 2022 17:16:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/cluster-setup-for-ml-work-for-pandas-in-spark-and-vanilla-python/m-p/31192#M1660</guid>
      <dc:creator>Vik1</dc:creator>
      <dc:date>2022-01-21T17:16:42Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster setup for ML work for Pandas in Spark, and vanilla Python.</title>
      <link>https://community.databricks.com/t5/machine-learning/cluster-setup-for-ml-work-for-pandas-in-spark-and-vanilla-python/m-p/31193#M1661</link>
      <description>&lt;P&gt;Hi again! Thanks for this question also and for your patience. We'll be back after we give the members of the community a chance to respond. &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 21 Jan 2022 19:52:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/cluster-setup-for-ml-work-for-pandas-in-spark-and-vanilla-python/m-p/31193#M1661</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-01-21T19:52:57Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster setup for ML work for Pandas in Spark, and vanilla Python.</title>
      <link>https://community.databricks.com/t5/machine-learning/cluster-setup-for-ml-work-for-pandas-in-spark-and-vanilla-python/m-p/31194#M1662</link>
      <description>&lt;P&gt;Python code runs on the driver.  Distributed/Spark code runs on the workers.  &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Here are some cluster tips:&lt;/P&gt;&lt;P&gt;If you're doing ML, then use an ML runtime. &lt;/P&gt;&lt;P&gt; If you're not doing distributed stuff, use a single node cluster.  &lt;/P&gt;&lt;P&gt;Don't use autoscaling for ML.  &lt;/P&gt;&lt;P&gt;For Deep Learning use GPUs&lt;/P&gt;&lt;P&gt;Try to size the cluster for the data size.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 21 Jan 2022 20:00:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/cluster-setup-for-ml-work-for-pandas-in-spark-and-vanilla-python/m-p/31194#M1662</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-01-21T20:00:55Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster setup for ML work for Pandas in Spark, and vanilla Python.</title>
      <link>https://community.databricks.com/t5/machine-learning/cluster-setup-for-ml-work-for-pandas-in-spark-and-vanilla-python/m-p/31195#M1663</link>
      <description>&lt;P&gt;@Vivek Ranjan​&amp;nbsp;- Does Joseph's answer help? If it does, would you be happy to mark it as best? If it doesn't, please tell us so we can help you. &lt;/P&gt;</description>
      <pubDate>Mon, 07 Mar 2022 17:33:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/cluster-setup-for-ml-work-for-pandas-in-spark-and-vanilla-python/m-p/31195#M1663</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-03-07T17:33:38Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster setup for ML work for Pandas in Spark, and vanilla Python.</title>
      <link>https://community.databricks.com/t5/machine-learning/cluster-setup-for-ml-work-for-pandas-in-spark-and-vanilla-python/m-p/31196#M1664</link>
      <description>&lt;P&gt;Hey there @Vivek Ranjan​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Checking in. If Joseph's answer helped, would you let us know and mark the answer as best? &amp;nbsp;It would be really helpful for the other members to find the solution more quickly.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Fri, 22 Apr 2022 14:23:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/cluster-setup-for-ml-work-for-pandas-in-spark-and-vanilla-python/m-p/31196#M1664</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-04-22T14:23:05Z</dc:date>
    </item>
  </channel>
</rss>

