cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Remove duplicate records using pyspark

sanjay
Valued Contributor II

Hi,

I am trying to remove duplicate records from pyspark dataframe and keep the latest one. But somehow df.dropDuplicates["id"] keeps the first one instead of latest. One of the option is to use pandas drop_duplicates, Is there any solution in pyspark.

Thanks,

Sanjay

1 ACCEPTED SOLUTION

Accepted Solutions

szymon_dybczak
Esteemed Contributor III

Hi @sanjay ,

You can write window function that will rank your rows and then filter rows based on that rank.

Take a look on below stackoverflow thread: 

https://stackoverflow.com/questions/63343958/how-to-drop-duplicates-but-keep-first-in-pyspark-datafr...

 

View solution in original post

1 REPLY 1

szymon_dybczak
Esteemed Contributor III

Hi @sanjay ,

You can write window function that will rank your rows and then filter rows based on that rank.

Take a look on below stackoverflow thread: 

https://stackoverflow.com/questions/63343958/how-to-drop-duplicates-but-keep-first-in-pyspark-datafr...

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now