cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to configure Spark to adjust the number of output partitions after a join or groupBy?

Rinat
New Contributor

I know you can set "spark.sql.shuffle.partitions" and "spark.sql.adaptive.advisoryPartitionSizeInBytes". The former will not work with adaptive query execution, and the latter only works for the first shuffle for some reason, after which it just uses the default number of partitions, i.e. #cores, because there are no partitions to coalesce.

Is there a way to configure AQE to adjust the number of partitions such that each partition is no more than 100MB? I only see that it can be done for skewed partitions, but not for all Exchange operations.

Currently, we had to turn off AQE for a big job, because it resulted in 100+ TB spills. Alternatively, we can also manually repartition a DataFrame each time, but that is not very convenient and is error-prone.

Thank you.

0 REPLIES 0

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now