Databricks Community

sanjay · ‎08-28-2024

Hi,

I am trying to remove duplicate records from pyspark dataframe and keep the latest one. But somehow df.dropDuplicates["id"] keeps the first one instead of latest. One of the option is to use pandas drop_duplicates, Is there any solution in pyspark.

Thanks,

Sanjay

szymon_dybczak · ‎08-28-2024

Hi @sanjay ,

You can write window function that will rank your rows and then filter rows based on that rank.

Take a look on below stackoverflow thread:

https://stackoverflow.com/questions/63343958/how-to-drop-duplicates-but-keep-first-in-pyspark-datafr...

View solution in original post

szymon_dybczak · ‎08-28-2024

Hi @sanjay ,

You can write window function that will rank your rows and then filter rows based on that rank.

Take a look on below stackoverflow thread:

https://stackoverflow.com/questions/63343958/how-to-drop-duplicates-but-keep-first-in-pyspark-datafr...

Databricks Community

Remove duplicate records using pyspark

Photos

Connect with Databricks Users in Your Area

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Databricks Community Champion - February 2025 - Stefan Koch

Virtual Learning Festival: 9 April - 30 April

Women’s Week Challenge: Play, Engage & Win Swag

Data + AI Summit 2025 — registration now open!