Cluster setup for ML work for Pandas in Spark, and vanilla Python.

Vik1
New Contributor II

My setup:

Worker type: Standard_D32d_v4, 128 GB Memory, 32 Cores, Min Workers: 2, Max Workers: 8

Driver type: Standard_D32ds_v4, 128 GB Memory, 32 Cores

Databricks Runtime Version: 10.2 ML (includes Apache Spark 3.2.0, Scala 2.12)

I ran a snowflake query and pulled in two datasets 30 million rows and 7 columns. Saved them as pyspark.pandas.frame.DataFrame, call them df1, and df2 (the two dataframes)

1st column of each of these datasets is a household_id. I want to check how many household_id from df1 is not in df2.

I tried two different ways:

len(set(df1['household_id'].to_list).difference(df2['household_id'].to_list()))

df1['household_id'].isin(df2['household_id'].to_list()).value_counts()

The above two fail because of out of memory issue.

My questions are:

  1. Where is the python list computation happening as in first code snippet? Is it on driver node or worker node? I believe that code is being run in a single node and not distributed?
  2. Is there a way to better debug out of memory issue? Such as which piece of code? Which node the code failed., etc.
  3. What is the best guidance on creating a cluster? This could depend on understanding how pieces of code will run such as distributed across worker nodes, or running on a single driver . node. Is there a general guidance if driver node should be beefier (larger memory and cores) as compared to worker nodes or vice-versa?