cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Remove duplicate records using pyspark

sanjay
Valued Contributor II

Hi,

I am trying to remove duplicate records from pyspark dataframe and keep the latest one. But somehow df.dropDuplicates["id"] keeps the first one instead of latest. One of the option is to use pandas drop_duplicates, Is there any solution in pyspark.

Thanks,

Sanjay

1 ACCEPTED SOLUTION

Accepted Solutions

szymon_dybczak
Contributor

Hi @sanjay ,

You can write window function that will rank your rows and then filter rows based on that rank.

Take a look on below stackoverflow thread: 

https://stackoverflow.com/questions/63343958/how-to-drop-duplicates-but-keep-first-in-pyspark-datafr...

 

View solution in original post

1 REPLY 1

szymon_dybczak
Contributor

Hi @sanjay ,

You can write window function that will rank your rows and then filter rows based on that rank.

Take a look on below stackoverflow thread: 

https://stackoverflow.com/questions/63343958/how-to-drop-duplicates-but-keep-first-in-pyspark-datafr...

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group