topic Re: Remove duplicate records using pyspark in Get Started Discussions

Remove duplicate records using pyspark

sanjay — Wed, 28 Aug 2024 17:15:01 GMT

Hi,

I am trying to remove duplicate records from pyspark dataframe and keep the latest one. But somehow df.dropDuplicates["id"] keeps the first one instead of latest. One of the option is to use pandas drop_duplicates, Is there any solution in pyspark.

Thanks,

Sanjay

Re: Remove duplicate records using pyspark

szymon_dybczak — Wed, 28 Aug 2024 18:36:19 GMT

Hi @sanjay ,

You can write window function that will rank your rows and then filter rows based on that rank.

Take a look on below stackoverflow thread:

https://stackoverflow.com/questions/63343958/how-to-drop-duplicates-but-keep-first-in-pyspark-dataframe