Databricks Community

sanjay · ‎02-11-2024

Hi,

I am trying to delete duplicate records found by key but its very slow. Its continuous running pipeline so data is not that huge but still it takes time to execute this command.

df = df.dropDuplicates(["fileName"])

Is there any better approach to delete duplicate data from pyspark dataframe.

Regards,

Sanjay

sanjay · ‎02-12-2024

Thank you @Retired_mod. As I am trying to remove duplicate only on single column, so am specifying column name in dropDuplicates. Still its very slow. Can you provide more context on last point i.e.

Streamlining Your Data with Grouping and Aggregation: To easily condense your dataset by a single column's values, utilize the power of aggregation functions.

Is there any possibility to tune dropDuplicate

Databricks Community

pyspark dropDuplicates performance issue

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon