cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Cluster setup for ML work for Pandas in Spark, and vanilla Python.

Vik1
New Contributor II

My setup:

Worker type: Standard_D32d_v4, 128 GB Memory, 32 Cores, Min Workers: 2, Max Workers: 8

Driver type: Standard_D32ds_v4, 128 GB Memory, 32 Cores

Databricks Runtime Version: 10.2 ML (includes Apache Spark 3.2.0, Scala 2.12)

I ran a snowflake query and pulled in two datasets 30 million rows and 7 columns. Saved them as pyspark.pandas.frame.DataFrame, call them df1, and df2 (the two dataframes)

1st column of each of these datasets is a household_id. I want to check how many household_id from df1 is not in df2.

I tried two different ways:

len(set(df1['household_id'].to_list).difference(df2['household_id'].to_list()))

df1['household_id'].isin(df2['household_id'].to_list()).value_counts()

The above two fail because of out of memory issue.

My questions are:

  1. Where is the python list computation happening as in first code snippet? Is it on driver node or worker node? I believe that code is being run in a single node and not distributed?
  2. Is there a way to better debug out of memory issue? Such as which piece of code? Which node the code failed., etc.
  3. What is the best guidance on creating a cluster? This could depend on understanding how pieces of code will run such as distributed across worker nodes, or running on a single driver . node. Is there a general guidance if driver node should be beefier (larger memory and cores) as compared to worker nodes or vice-versa?

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

Python code runs on the driver. Distributed/Spark code runs on the workers.

Here are some cluster tips:

If you're doing ML, then use an ML runtime.

If you're not doing distributed stuff, use a single node cluster.

Don't use autoscaling for ML.

For Deep Learning use GPUs

Try to size the cluster for the data size.

View solution in original post

4 REPLIES 4

Anonymous
Not applicable

Hi again! Thanks for this question also and for your patience. We'll be back after we give the members of the community a chance to respond. 🙂

Anonymous
Not applicable

Python code runs on the driver. Distributed/Spark code runs on the workers.

Here are some cluster tips:

If you're doing ML, then use an ML runtime.

If you're not doing distributed stuff, use a single node cluster.

Don't use autoscaling for ML.

For Deep Learning use GPUs

Try to size the cluster for the data size.

Anonymous
Not applicable

@Vivek Ranjan​ - Does Joseph's answer help? If it does, would you be happy to mark it as best? If it doesn't, please tell us so we can help you.

Anonymous
Not applicable

Hey there @Vivek Ranjan​ 

Checking in. If Joseph's answer helped, would you let us know and mark the answer as best?  It would be really helpful for the other members to find the solution more quickly.

Thanks!

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!