- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-28-2024 10:15 AM
Hi,
I am trying to remove duplicate records from pyspark dataframe and keep the latest one. But somehow df.dropDuplicates["id"] keeps the first one instead of latest. One of the option is to use pandas drop_duplicates, Is there any solution in pyspark.
Thanks,
Sanjay
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-28-2024 11:36 AM
Hi @sanjay ,
You can write window function that will rank your rows and then filter rows based on that rank.
Take a look on below stackoverflow thread:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-28-2024 11:36 AM
Hi @sanjay ,
You can write window function that will rank your rows and then filter rows based on that rank.
Take a look on below stackoverflow thread:

