cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to configure Spark to adjust the number of output partitions after a join or groupBy?

Rinat
New Contributor

I know you can set "spark.sql.shuffle.partitions" and "spark.sql.adaptive.advisoryPartitionSizeInBytes". The former will not work with adaptive query execution, and the latter only works for the first shuffle for some reason, after which it just uses the default number of partitions, i.e. #cores, because there are no partitions to coalesce.

Is there a way to configure AQE to adjust the number of partitions such that each partition is no more than 100MB? I only see that it can be done for skewed partitions, but not for all Exchange operations.

Currently, we had to turn off AQE for a big job, because it resulted in 100+ TB spills. Alternatively, we can also manually repartition a DataFrame each time, but that is not very convenient and is error-prone.

Thank you.

0 REPLIES 0
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.