Databricks Community

sanjay · ‎08-28-2024

Hi,

I am trying to remove duplicate records from pyspark dataframe and keep the latest one. But somehow df.dropDuplicates["id"] keeps the first one instead of latest. One of the option is to use pandas drop_duplicates, Is there any solution in pyspark.

Thanks,

Sanjay

szymon_dybczak · ‎08-28-2024

Hi @sanjay ,

You can write window function that will rank your rows and then filter rows based on that rank.

Take a look on below stackoverflow thread:

https://stackoverflow.com/questions/63343958/how-to-drop-duplicates-but-keep-first-in-pyspark-datafr...

View solution in original post

szymon_dybczak · ‎08-28-2024

Hi @sanjay ,

You can write window function that will rank your rows and then filter rows based on that rank.

Take a look on below stackoverflow thread:

https://stackoverflow.com/questions/63343958/how-to-drop-duplicates-but-keep-first-in-pyspark-datafr...

Databricks Community

Remove duplicate records using pyspark

Photos

Connect with Databricks Users in Your Area

Data + AI Summit 2025 — registration now open!

Jumpstart Your Data Journey with Databricks Get Started Days!

Databricks DevConnect: Global Community Meetups for Data Engineers

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Introducing SAP Databricks