Databricks Community

pantelis_mare · ‎04-27-2022

Hello everybody,

I recently discovered (the hard way) that when a query plan uses cached data, the AQE does not kick-in. Result is that you loose the super cool feature of dynamic partition coalesce (no more custom shuffle readers in the DAG).

Is there a way to combine both? If not, do you guys know what the rule is or have any links I could read?

My understanding after testing is that if a cached dataframe in present in the sql query, then you have no adaptive query plan on the whole query. Is that correct?

Cheers,

Pantelis

User16763506477 · ‎08-08-2022

Hi @Pantelis Maroudis

Do you have a sample query to test this? AQE was kicked in when I tried with a simple aggregation query (i.e group by) on a cached table.

pantelis_mare · ‎09-13-2022

Hello @Gaurav Rupnar

The following code snippet reproduces my statement.

See how the query plan changes when you comment the cache() on the res dataframe

spark.conf.set("spark.sql.shuffle.partitions", 2000)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
 
val factData = Seq(1,2,3,4,5,6,7,8,9,10).toDF("value")
val dimData = Seq(1,2,3).toDF("value")
 
val res = factData.join(dimData, Seq("value"))
 
res.cache()
res.write.format("noop").mode("append").save()
res.unpersist()

jose_gonzalez · ‎08-15-2022

Hi @Pantelis Maroudis,

Did you check the physical query plan? did you check the SQL sub tab with in Spark UI? it will help you to undertand better what is happening.

Databricks Community

Spark 3 AQE and cache

Congratulations Databricks Partners! You're Now Officially Recognized in the Databricks Community

Solution Accelerator Series | Measure Ad Effectiveness With Multi-Touch Attribution

Govern AI Spend at Scale: A Data-Driven Approach to AI Governance | Webinar

Databricks AMER Learning Festival | Virtual Training

Introducing the Genie Hub: Ask Questions, Share Builds, and Master Conversational Analytics