drop duplicate in 500B records
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-28-2024 01:42 PM
I’m trying to drop duplicate in a DF where I have 500B records I’m trying to delete based on multiple columns but this process it’s takes 5h, I try lot of things that available on internet but nothing is works for me.
my code is like this.
df_1=spark.read.format(delta).table(t1) - 60M -200 partition
df_2=spark.read.format(delta).table(t2) - 8M - 160 partition
df_join=df_1.join(broadcast(df_2),city_code,left) - 500B - 300 partition
till here my job is only taking 1mins to process this data but when I add below line it’s takes 5hours
df_clean=df_join.dropDuplicate([col1,col2,col3])
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-28-2024 07:45 PM
Hi @ImAbhishekTomar, How are you doing today?
To speed up your job,Give a try repartitioning the DataFrame by the columns you're dropping duplicates on before running dropDuplicates. You could also checkpoint the DataFrame to simplify its lineage. If that doesn't help, consider using a group by method instead of dropDuplicates or optimizing your Delta tables with Z-ordering. Lastly, make sure your cluster has enough resources to handle the load.
Give a try and let me know if it works.
Good day.
Regards,
Brahma
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-29-2024 01:48 PM
Drop the duplicates from the df_1 and df_2 first and then do the join.
If the join is just a city code, then most likely you know which rows in df_2 and in df_1 will give you the duplicates in df_join. So drop in df_1 and drop in df_2 instead of df_join.

