How to configure Spark to adjust the number of output partitions after a join or groupBy?

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

I know you can set "spark.sql.shuffle.partitions" and "spark.sql.adaptive.advisoryPartitionSizeInBytes". The former will not work with adaptive query execution, and the latter only works for the first shuffle for some reason, after which it just uses the default number of partitions, i.e. #cores, because there are no partitions to coalesce.

Is there a way to configure AQE to adjust the number of partitions such that each partition is no more than 100MB? I only see that it can be done for skewed partitions, but not for all Exchange operations.

Currently, we had to turn off AQE for a big job, because it resulted in 100+ TB spills. Alternatively, we can also manually repartition a DataFrame each time, but that is not very convenient and is error-prone.

Thank you.