Hi,
I am trying to delete duplicate records found by key but its very slow. Its continuous running pipeline so data is not that huge but still it takes time to execute this command.
df = df.dropDuplicates(["fileName"])
Is there any better approach to delete duplicate data from pyspark dataframe.
Regards,
Sanjay