topic Spark 3 AQE and cache in Data Engineering

Spark 3 AQE and cache

pantelis_mare — Wed, 27 Apr 2022 07:41:08 GMT

Hello everybody,

I recently discovered (the hard way) that when a query plan uses cached data, the AQE does not kick-in. Result is that you loose the super cool feature of dynamic partition coalesce (no more custom shuffle readers in the DAG).

Is there a way to combine both? If not, do you guys know what the rule is or have any links I could read?

My understanding after testing is that if a cached dataframe in present in the sql query, then you have no adaptive query plan on the whole query. Is that correct?

Cheers,

Pantelis

Re: Spark 3 AQE and cache

User16763506477 — Tue, 09 Aug 2022 02:41:47 GMT

Hi @Pantelis Maroudis

Do you have a sample query to test this? AQE was kicked in when I tried with a simple aggregation query (i.e group by) on a cached table.

Re: Spark 3 AQE and cache

jose_gonzalez — Mon, 15 Aug 2022 20:45:53 GMT

Hi @Pantelis Maroudis,

Did you check the physical query plan? did you check the SQL sub tab with in Spark UI? it will help you to undertand better what is happening.

Re: Spark 3 AQE and cache

pantelis_mare — Tue, 13 Sep 2022 16:28:00 GMT

Hello @Gaurav Rupnar

The following code snippet reproduces my statement.

See how the query plan changes when you comment the cache() on the res dataframe

spark.conf.set("spark.sql.shuffle.partitions", 2000)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
 
val factData = Seq(1,2,3,4,5,6,7,8,9,10).toDF("value")
val dimData = Seq(1,2,3).toDF("value")
 
val res = factData.join(dimData, Seq("value"))
 
res.cache()
res.write.format("noop").mode("append").save()
res.unpersist()