cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Spark 3 AQE and cache

pantelis_mare
Contributor III

Hello everybody,

I recently discovered (the hard way) that when a query plan uses cached data, the AQE does not kick-in. Result is that you loose the super cool feature of dynamic partition coalesce (no more custom shuffle readers in the DAG).

Is there a way to combine both? If not, do you guys know what the rule is or have any links I could read?

My understanding after testing is that if a cached dataframe in present in the sql query, then you have no adaptive query plan on the whole query. Is that correct?

Cheers,

Pantelis

3 REPLIES 3

User16763506477
Contributor III

Hi @Pantelis Maroudis​ 

Do you have a sample query to test this? AQE was kicked in when I tried with a simple aggregation query (i.e group by) on a cached table.

Hello @Gaurav Rupnar​ 

The following code snippet reproduces my statement.

See how the query plan changes when you comment the cache() on the res dataframe

spark.conf.set("spark.sql.shuffle.partitions", 2000)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
 
val factData = Seq(1,2,3,4,5,6,7,8,9,10).toDF("value")
val dimData = Seq(1,2,3).toDF("value")
 
val res = factData.join(dimData, Seq("value"))
 
res.cache()
res.write.format("noop").mode("append").save()
res.unpersist()

jose_gonzalez
Moderator
Moderator

Hi @Pantelis Maroudis​,

Did you check the physical query plan? did you check the SQL sub tab with in Spark UI? it will help you to undertand better what is happening.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.