cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to configure Spark to adjust the number of output partitions after a join or groupBy?

Rinat
New Contributor

I know you can set "spark.sql.shuffle.partitions" and "spark.sql.adaptive.advisoryPartitionSizeInBytes". The former will not work with adaptive query execution, and the latter only works for the first shuffle for some reason, after which it just uses the default number of partitions, i.e. #cores, because there are no partitions to coalesce.

Is there a way to configure AQE to adjust the number of partitions such that each partition is no more than 100MB? I only see that it can be done for skewed partitions, but not for all Exchange operations.

Currently, we had to turn off AQE for a big job, because it resulted in 100+ TB spills. Alternatively, we can also manually repartition a DataFrame each time, but that is not very convenient and is error-prone.

Thank you.

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group