Databricks Community

techgeorge · ‎04-15-2025

In Spark, data skew can be the silent killer of performance. One wide partition pulling in 90% of the data?

But even with AQE (Adaptive Query Execution) turned on in Databricks, skewness isn't always automatically identified— and here’s why.

What Is coalesce() in Spark?

The coalesce(n) function reduces the number of partitions in a DataFrame without a full shuffle, usually used to compact data after a wide transformation like a join or groupBy. It’s especially useful when:

You're writing output to disk (e.g., Parquet, Delta) and want fewer files.
You're post-processing skewed data and want to redistribute load more evenly.

But this can result to disproportionately large volume of data remained concentrated in a single partition, leading to severe data skew — where one task handled the majority of the workload while others remained underutilized.

Shouldn’t AQE(Adaptive Query Execution) have caught this?

coalesce(n) operation does not trigger a full shuffle like repartition(n). There is therefore no signal to Catalyst for run-time optimizing to see if AQE could be applied - as there is no full shuffle to be detected, which serves as an optimization, precursor condition for invoking AQE.