My setup:
Worker type: Standard_D32d_v4, 128 GB Memory, 32 Cores, Min Workers: 2, Max Workers: 8
Driver type: Standard_D32ds_v4, 128 GB Memory, 32 Cores
Databricks Runtime Version: 10.2 ML (includes Apache Spark 3.2.0, Scala 2.12)
I ran a snowflake query and pulled in two datasets 30 million rows and 7 columns. Saved them as pyspark.pandas.frame.DataFrame, call them df1, and df2 (the two dataframes)
1st column of each of these datasets is a household_id. I want to check how many household_id from df1 is not in df2.
I tried two different ways:
len(set(df1['household_id'].to_list).difference(df2['household_id'].to_list()))
df1['household_id'].isin(df2['household_id'].to_list()).value_counts()
The above two fail because of out of memory issue.
My questions are:
- Where is the python list computation happening as in first code snippet? Is it on driver node or worker node? I believe that code is being run in a single node and not distributed?
- Is there a way to better debug out of memory issue? Such as which piece of code? Which node the code failed., etc.
- What is the best guidance on creating a cluster? This could depend on understanding how pieces of code will run such as distributed across worker nodes, or running on a single driver . node. Is there a general guidance if driver node should be beefier (larger memory and cores) as compared to worker nodes or vice-versa?