What is the difference between coalesce and repart...

aladda · ‎06-19-2021

aladda · ‎06-19-2021

Coalesce essentially groups multiple partitions into a larger partitions. So use coalesce when you want to reduce the number of partitions (and also tasks) without impacting sort order. Ex:- when you want to write-out a single CSV file output instead of multiple parts

Use repartition when you want to cause a shuffle that changes the number of partitions. A common use-case for repartition is to remove skew in file sizes or to start out with a smaller/different number of partitions than the default in Spark

View solution in original post

What is the difference between coalesce and repartition when it comes to shuffle partitions in spark