I’m trying to drop duplicate in a DF where I have 500B records I’m trying to delete based on multiple columns but this process it’s takes 5h, I try lot of things that available on internet but nothing is works for me.
my code is like this.
df_1=spark.read.format(delta).table(t1) - 60M -200 partition
df_2=spark.read.format(delta).table(t2) - 8M - 160 partition
df_join=df_1.join(broadcast(df_2),city_code,left) - 500B - 300 partition
till here my job is only taking 1mins to process this data but when I add below line it’s takes 5hours
df_clean=df_join.dropDuplicate([col1,col2,col3])