cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Spark 3 AQE and cache

pantelis_mare
Contributor III

Hello everybody,

I recently discovered (the hard way) that when a query plan uses cached data, the AQE does not kick-in. Result is that you loose the super cool feature of dynamic partition coalesce (no more custom shuffle readers in the DAG).

Is there a way to combine both? If not, do you guys know what the rule is or have any links I could read?

My understanding after testing is that if a cached dataframe in present in the sql query, then you have no adaptive query plan on the whole query. Is that correct?

Cheers,

Pantelis

3 REPLIES 3

User16763506477
Contributor III

Hi @Pantelis Maroudisโ€‹ 

Do you have a sample query to test this? AQE was kicked in when I tried with a simple aggregation query (i.e group by) on a cached table.

Hello @Gaurav Rupnarโ€‹ 

The following code snippet reproduces my statement.

See how the query plan changes when you comment the cache() on the res dataframe

spark.conf.set("spark.sql.shuffle.partitions", 2000)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
 
val factData = Seq(1,2,3,4,5,6,7,8,9,10).toDF("value")
val dimData = Seq(1,2,3).toDF("value")
 
val res = factData.join(dimData, Seq("value"))
 
res.cache()
res.write.format("noop").mode("append").save()
res.unpersist()

jose_gonzalez
Databricks Employee
Databricks Employee

Hi @Pantelis Maroudisโ€‹,

Did you check the physical query plan? did you check the SQL sub tab with in Spark UI? it will help you to undertand better what is happening.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now