Spark 3 AQE and cache
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-27-2022 12:41 AM
Hello everybody,
I recently discovered (the hard way) that when a query plan uses cached data, the AQE does not kick-in. Result is that you loose the super cool feature of dynamic partition coalesce (no more custom shuffle readers in the DAG).
Is there a way to combine both? If not, do you guys know what the rule is or have any links I could read?
My understanding after testing is that if a cached dataframe in present in the sql query, then you have no adaptive query plan on the whole query. Is that correct?
Cheers,
Pantelis
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-08-2022 07:41 PM
Hi @Pantelis Maroudis
Do you have a sample query to test this? AQE was kicked in when I tried with a simple aggregation query (i.e group by) on a cached table.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-13-2022 09:28 AM
Hello @Gaurav Rupnar
The following code snippet reproduces my statement.
See how the query plan changes when you comment the cache() on the res dataframe
spark.conf.set("spark.sql.shuffle.partitions", 2000)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
val factData = Seq(1,2,3,4,5,6,7,8,9,10).toDF("value")
val dimData = Seq(1,2,3).toDF("value")
val res = factData.join(dimData, Seq("value"))
res.cache()
res.write.format("noop").mode("append").save()
res.unpersist()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-15-2022 01:45 PM
Hi @Pantelis Maroudis,
Did you check the physical query plan? did you check the SQL sub tab with in Spark UI? it will help you to undertand better what is happening.