Understanding Coalesce, Skewed Joins, and Why AQE Doesn't Always Intervene

techgeorge — Tue, 15 Apr 2025 21:15:46 GMT

In Spark, data skew can be the silent killer of performance. One wide partition pulling in 90% of the data?

But even with AQE (Adaptive Query Execution) turned on in Databricks, skewness isn't always automatically identified— and here’s why.

What Is coalesce() in Spark?

The coalesce(n) function reduces the number of partitions in a DataFrame without a full shuffle, usually used to compact data after a wide transformation like a join or groupBy. It’s especially useful when:

You're writing output to disk (e.g., Parquet, Delta) and want fewer files.
You're post-processing skewed data and want to redistribute load more evenly.

But this can result to disproportionately large volume of data remained concentrated in a single partition, leading to severe data skew — where one task handled the majority of the workload while others remained underutilized.

Shouldn’t AQE(Adaptive Query Execution) have caught this?

coalesce(n) operation does not trigger a full shuffle like repartition(n). There is therefore no signal to Catalyst for run-time optimizing to see if AQE could be applied - as there is no full shuffle to be detected, which serves as an optimization, precursor condition for invoking AQE.

Conclusion

AQE didn’t help — not because it failed, but because we never gave it the chance.

Re: Understanding Coalesce, Skewed Joins, and Why AQE Doesn't Always Intervene

Louis_Frolio — Thu, 17 Apr 2025 23:18:40 GMT

@mark_ott , this question seems right up your alley. Care to comment?

topic Understanding Coalesce, Skewed Joins, and Why AQE Doesn't Always Intervene in Community Articles

Understanding Coalesce, Skewed Joins, and Why AQE Doesn't Always Intervene

What Is coalesce() in Spark?

Shouldn’t AQE(Adaptive Query Execution) have caught this?

Conclusion

Re: Understanding Coalesce, Skewed Joins, and Why AQE Doesn't Always Intervene