cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

pyspark dropDuplicates performance issue

sanjay
Valued Contributor II

Hi,

I am trying to delete duplicate records found by key but its very slow.  Its continuous running pipeline so data is not that huge but still it takes time to execute this command.

df = df.dropDuplicates(["fileName"])

Is there any better approach to delete duplicate data from pyspark dataframe.

Regards,

Sanjay

1 REPLY 1

sanjay
Valued Contributor II

Thank you @Retired_mod. As I am trying to remove duplicate only on single column, so am specifying column name in dropDuplicates. Still its very slow. Can you provide more context on last point i.e. 

  • Streamlining Your Data with Grouping and Aggregation: To easily condense your dataset by a single column's values, utilize the power of aggregation functions. 

Is there any possibility to tune dropDuplicate

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group