cancel
Showing results for 
Search instead for 
Did you mean: 
Community Articles
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results for 
Search instead for 
Did you mean: 

Understanding Coalesce, Skewed Joins, and Why AQE Doesn't Always Intervene

techgeorge
New Contributor II

In Spark, data skew can be the silent killer of performance. One wide partition pulling in 90% of the data?

But even with AQE (Adaptive Query Execution) turned on in Databricks, skewness isn't always automatically identified— and here’s why.

What Is coalesce() in Spark?

The coalesce(n) function reduces the number of partitions in a DataFrame without a full shuffle, usually used to compact data after a wide transformation like a join or groupBy. It’s especially useful when:

  • You're writing output to disk (e.g., Parquet, Delta) and want fewer files.

  • You're post-processing skewed data and want to redistribute load more evenly.

But this can result to disproportionately large volume of data remained concentrated in a single partition, leading to severe data skew — where one task handled the majority of the workload while others remained underutilized. 

Data Skew.png

 

Shouldn’t AQE(Adaptive Query Execution) have caught this?

coalesce(n) operation does not trigger a full shuffle like repartition(n)There is therefore no signal to Catalyst for run-time optimizing to see if AQE could be applied - as there is no full shuffle to be detected, which serves as an optimization, precursor condition for invoking AQE.

Conclusion

AQE didn’t help — not because it failed, but because we never gave it the chance.

 

@techgeorge
1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

@mark_ott , this question seems right up your alley. Care to comment?