cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

What is the difference between coalesce and repartition when it comes to shuffle partitions in spark

Anand_Ladda
Honored Contributor II
 
1 ACCEPTED SOLUTION

Accepted Solutions

Anand_Ladda
Honored Contributor II

Coalesce essentially groups multiple partitions into a larger partitions. So use coalesce when you want to reduce the number of partitions (and also tasks) without impacting sort order.  Ex:- when you want to write-out a single CSV file output instead of multiple parts

Use repartition when you want to cause a shuffle that changes the number of partitions.  A common use-case for repartition is to remove skew in file sizes or to start out with a smaller/different number of partitions than the default in Spark

View solution in original post

1 REPLY 1

Anand_Ladda
Honored Contributor II

Coalesce essentially groups multiple partitions into a larger partitions. So use coalesce when you want to reduce the number of partitions (and also tasks) without impacting sort order.  Ex:- when you want to write-out a single CSV file output instead of multiple parts

Use repartition when you want to cause a shuffle that changes the number of partitions.  A common use-case for repartition is to remove skew in file sizes or to start out with a smaller/different number of partitions than the default in Spark

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.