Before dropDuplicates eensure that your DataFrame operations are optimized by caching intermediate results if they are reused multiple times. This can help reduce the overall execution time.
We could use some aggregates and grouping like
df_deduped = df.groupBy("fileName").agg(first("fileName").alias("fileName"))