drop duplicate in 500B records

ImAbhishekTomar · ‎08-28-2024

I’m trying to drop duplicate in a DF where I have 500B records I’m trying to delete based on multiple columns but this process it’s takes 5h, I try lot of things that available on internet but nothing is works for me.

my code is like this.

df_1=spark.read.format(delta).table(t1) - 60M -200 partition
df_2=spark.read.format(delta).table(t2) - 8M - 160 partition

df_join=df_1.join(broadcast(df_2),city_code,left) - 500B - 300 partition

till here my job is only taking 1mins to process this data but when I add below line it’s takes 5hours

df_clean=df_join.dropDuplicate([col1,col2,col3])

Brahmareddy · ‎08-28-2024

Hi @ImAbhishekTomar, How are you doing today?

To speed up your job,Give a try repartitioning the DataFrame by the columns you're dropping duplicates on before running dropDuplicates. You could also checkpoint the DataFrame to simplify its lineage. If that doesn't help, consider using a group by method instead of dropDuplicates or optimizing your Delta tables with Z-ordering. Lastly, make sure your cluster has enough resources to handle the load.

Give a try and let me know if it works.

Good day.

Regards,

Brahma

filipniziol · ‎08-29-2024

Drop the duplicates from the df_1 and df_2 first and then do the join.
If the join is just a city code, then most likely you know which rows in df_2 and in df_1 will give you the duplicates in df_join. So drop in df_1 and drop in df_2 instead of df_join.