cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

pyspark dropDuplicates performance issue

sanjay
Valued Contributor II

Hi,

I am trying to delete duplicate records found by key but its very slow.  Its continuous running pipeline so data is not that huge but still it takes time to execute this command.

df = df.dropDuplicates(["fileName"])

Is there any better approach to delete duplicate data from pyspark dataframe.

Regards,

Sanjay

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager
  • Hi @sanjay, A smarter alternative to using dropDuplicates() would be to leverage the groupBy() function and apply aggregation functions. By doing so, you can effectively reduce your dataset by a given column's values. For instance, if you only need a single instance of every value in the "fileName" column, you can group by that column and aggregate the remaining columns (if necessary).

 

  • Also, By using window functions, you can perform calculations on a group of rows that are connected to the current row. This feature proves to be extremely valuable when it comes to removing duplicate data. In order to retain the initial occurrence of each value in the "fileName" column, one can utilize the row_number() function within a window specification.

 

  • Guarantee that the data types of columns are suitable. By casting columns to more efficient types, it can enhance performance. For instance, if the value stored in "fileName" is currently a string, you can convert it to the appropriate data type for strings.

 

  • To boost the efficiency of your queries, consider dividing and arranging your DataFrame by the "fileName" column if it's extensive. You can easily achieve this with repartition() and sortWithinPartitions().

.

View solution in original post

3 REPLIES 3

Kaniz
Community Manager
Community Manager

Hi @sanjay, When it comes to handling duplicate data in a PySpark DataFrame, there are more effective techniques available instead of relying on dropDuplicates(). 

 

Let's dive into some superior alternatives: 

 

  • Utilizing dropDuplicates() with Column Subset: The dropDuplicates() function eradicates redundant rows from a DataFrame. You have the option to specify a subset of columns to take into account for identifying duplicates. This will ensure that only one instance of each unique value in the specified column(s) is kept.
  • To ensure uniqueness in our DataFrame, we can use the distinct() transformation. This handy function removes any duplicate rows, taking into account all columns. Let's see it in action with an example using Python: df = df.distinct()
  • If you encounter issues with duplicate comparisons, they might be caused by data type conflicts. To avoid this, make sure that the column being used for deduplication is in the correct format, such as string. 
  • Streamlining Your Data with Grouping and Aggregation: To easily condense your dataset by a single column's values, utilize the power of aggregation functions. 

sanjay
Valued Contributor II

Thank you @Kaniz. As I am trying to remove duplicate only on single column, so am specifying column name in dropDuplicates. Still its very slow. Can you provide more context on last point i.e. 

  • Streamlining Your Data with Grouping and Aggregation: To easily condense your dataset by a single column's values, utilize the power of aggregation functions. 

Is there any possibility to tune dropDuplicate

Kaniz
Community Manager
Community Manager
  • Hi @sanjay, A smarter alternative to using dropDuplicates() would be to leverage the groupBy() function and apply aggregation functions. By doing so, you can effectively reduce your dataset by a given column's values. For instance, if you only need a single instance of every value in the "fileName" column, you can group by that column and aggregate the remaining columns (if necessary).

 

  • Also, By using window functions, you can perform calculations on a group of rows that are connected to the current row. This feature proves to be extremely valuable when it comes to removing duplicate data. In order to retain the initial occurrence of each value in the "fileName" column, one can utilize the row_number() function within a window specification.

 

  • Guarantee that the data types of columns are suitable. By casting columns to more efficient types, it can enhance performance. For instance, if the value stored in "fileName" is currently a string, you can convert it to the appropriate data type for strings.

 

  • To boost the efficiency of your queries, consider dividing and arranging your DataFrame by the "fileName" column if it's extensive. You can easily achieve this with repartition() and sortWithinPartitions().

.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.