โ01-21-2022 09:16 AM
My setup:
Worker type: Standard_D32d_v4, 128 GB Memory, 32 Cores, Min Workers: 2, Max Workers: 8
Driver type: Standard_D32ds_v4, 128 GB Memory, 32 Cores
Databricks Runtime Version: 10.2 ML (includes Apache Spark 3.2.0, Scala 2.12)
I ran a snowflake query and pulled in two datasets 30 million rows and 7 columns. Saved them as pyspark.pandas.frame.DataFrame, call them df1, and df2 (the two dataframes)
1st column of each of these datasets is a household_id. I want to check how many household_id from df1 is not in df2.
I tried two different ways:
len(set(df1['household_id'].to_list).difference(df2['household_id'].to_list()))
df1['household_id'].isin(df2['household_id'].to_list()).value_counts()
The above two fail because of out of memory issue.
My questions are:
โ01-21-2022 12:00 PM
Python code runs on the driver. Distributed/Spark code runs on the workers.
Here are some cluster tips:
If you're doing ML, then use an ML runtime.
If you're not doing distributed stuff, use a single node cluster.
Don't use autoscaling for ML.
For Deep Learning use GPUs
Try to size the cluster for the data size.
โ01-21-2022 11:52 AM
Hi again! Thanks for this question also and for your patience. We'll be back after we give the members of the community a chance to respond. ๐
โ01-21-2022 12:00 PM
Python code runs on the driver. Distributed/Spark code runs on the workers.
Here are some cluster tips:
If you're doing ML, then use an ML runtime.
If you're not doing distributed stuff, use a single node cluster.
Don't use autoscaling for ML.
For Deep Learning use GPUs
Try to size the cluster for the data size.
โ03-07-2022 09:33 AM
@Vivek Ranjanโ - Does Joseph's answer help? If it does, would you be happy to mark it as best? If it doesn't, please tell us so we can help you.
โ04-22-2022 07:23 AM
Hey there @Vivek Ranjanโ
Checking in. If Joseph's answer helped, would you let us know and mark the answer as best? It would be really helpful for the other members to find the solution more quickly.
Thanks!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group