Spark 3 AQE and cache

pantelis_mare
Contributor III

Hello everybody,

I recently discovered (the hard way) that when a query plan uses cached data, the AQE does not kick-in. Result is that you loose the super cool feature of dynamic partition coalesce (no more custom shuffle readers in the DAG).

Is there a way to combine both? If not, do you guys know what the rule is or have any links I could read?

My understanding after testing is that if a cached dataframe in present in the sql query, then you have no adaptive query plan on the whole query. Is that correct?

Cheers,

Pantelis

User16763506477
Databricks Employee
Databricks Employee

Hi @Pantelis Maroudis​ 

Do you have a sample query to test this? AQE was kicked in when I tried with a simple aggregation query (i.e group by) on a cached table.

Hello @Gaurav Rupnar​ 

The following code snippet reproduces my statement.

See how the query plan changes when you comment the cache() on the res dataframe

spark.conf.set("spark.sql.shuffle.partitions", 2000)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
 
val factData = Seq(1,2,3,4,5,6,7,8,9,10).toDF("value")
val dimData = Seq(1,2,3).toDF("value")
 
val res = factData.join(dimData, Seq("value"))
 
res.cache()
res.write.format("noop").mode("append").save()
res.unpersist()

jose_gonzalez
Databricks Employee
Databricks Employee

Hi @Pantelis Maroudis​,

Did you check the physical query plan? did you check the SQL sub tab with in Spark UI? it will help you to undertand better what is happening.