cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

drop duplicate in 500B records

ImAbhishekTomar
New Contributor III

I’m trying to drop duplicate in a DF where I have 500B records I’m trying to delete  based on multiple columns but this process it’s takes 5h, I try lot of things that available on internet but nothing is works for me.

my code is like this.

df_1=spark.read.format(delta).table(t1) - 60M -200 partition
df_2=spark.read.format(delta).table(t2) - 8M - 160 partition

df_join=df_1.join(broadcast(df_2),city_code,left) - 500B - 300 partition

till here my job is only taking 1mins to process this data but when I add below line it’s takes 5hours

df_clean=df_join.dropDuplicate([col1,col2,col3])

 

 

2 REPLIES 2

Brahmareddy
Valued Contributor II

Hi @ImAbhishekTomar, How are you doing today?

To speed up your job,Give a try repartitioning the DataFrame by the columns you're dropping duplicates on before running dropDuplicates. You could also checkpoint the DataFrame to simplify its lineage. If that doesn't help, consider using a group by method instead of dropDuplicates or optimizing your Delta tables with Z-ordering. Lastly, make sure your cluster has enough resources to handle the load.

Give a try and let me know if it works.

Good day.

Regards,

Brahma

filipniziol
New Contributor III

Drop the duplicates from the df_1 and df_2 first and then do the join.
If the join is just a city code, then most likely you know which rows in df_2 and in df_1 will give you the duplicates in df_join. So drop in df_1 and drop in df_2 instead of df_join.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group