Databricks Community

ImAbhishekTomar · ‎08-28-2024

I’m trying to drop duplicate in a DF where I have 500B records I’m trying to delete based on multiple columns but this process it’s takes 5h, I try lot of things that available on internet but nothing is works for me.

my code is like this.

df_1=spark.read.format(delta).table(t1) - 60M -200 partition
df_2=spark.read.format(delta).table(t2) - 8M - 160 partition

df_join=df_1.join(broadcast(df_2),city_code,left) - 500B - 300 partition

till here my job is only taking 1mins to process this data but when I add below line it’s takes 5hours

df_clean=df_join.dropDuplicate([col1,col2,col3])

Brahmareddy · ‎08-28-2024

Hi @ImAbhishekTomar, How are you doing today?

To speed up your job,Give a try repartitioning the DataFrame by the columns you're dropping duplicates on before running dropDuplicates. You could also checkpoint the DataFrame to simplify its lineage. If that doesn't help, consider using a group by method instead of dropDuplicates or optimizing your Delta tables with Z-ordering. Lastly, make sure your cluster has enough resources to handle the load.

Give a try and let me know if it works.

Good day.

Regards,

Brahma

filipniziol · ‎08-29-2024

Drop the duplicates from the df_1 and df_2 first and then do the join.
If the join is just a city code, then most likely you know which rows in df_2 and in df_1 will give you the duplicates in df_join. So drop in df_1 and drop in df_2 instead of df_join.

Databricks Community

drop duplicate in 500B records

Join Us as a Local Community Builder!

Free Edition Hackathon

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples

Level Up with Databricks Specialist Sessions

🌟 Community Pulse: Your Weekly Roundup! November 07 – 13, 2025

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐