Databricks Community

sanjay · ‎02-11-2024

Hi,

I am trying to delete duplicate records found by key but its very slow. Its continuous running pipeline so data is not that huge but still it takes time to execute this command.

df = df.dropDuplicates(["fileName"])

Is there any better approach to delete duplicate data from pyspark dataframe.

Regards,

Sanjay

sanjay · ‎02-12-2024

Thank you @Retired_mod. As I am trying to remove duplicate only on single column, so am specifying column name in dropDuplicates. Still its very slow. Can you provide more context on last point i.e.

Streamlining Your Data with Grouping and Aggregation: To easily condense your dataset by a single column's values, utilize the power of aggregation functions.

Is there any possibility to tune dropDuplicate

NandiniN · ‎01-31-2025

Before dropDuplicates eensure that your DataFrame operations are optimized by caching intermediate results if they are reused multiple times. This can help reduce the overall execution time.

We could use some aggregates and grouping like

df_deduped = df.groupBy("fileName").agg(first("fileName").alias("fileName"))

Databricks Community

pyspark dropDuplicates performance issue

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples