Databricks Community

Direo · ‎09-17-2023

Hi!

We have recently upgraded our cluster from Databricks Runtime 10.4 LTS which includes Apache Spark 3.2.1 to to Databricks Runtime 13.3 LTSincludes Apache Spark 3.2.1 powered by Apache Spark 3.3.0 and noticed that one of our jobs runtime has dramatically increased (actually it was terminated before it even finished).

On 3.2.1 the job would run aprox. 40 mins, while on 3.3.0 it was terminated after 3 hours. According to driver logs, it stuck on one stage which had 160000 tasks (TaskSetManager: Finished task 18443.0 in stage 204.0 in 8746 ms on (executor 1) (17678/160000)) while the run on 3.2.1 would never reach that number - max. a few hundred tasks. Also worth mentioning that disk was expanding to the hights I never seen.

I was able to find the function where behaviour of 3.3.0 has changed compared to 3.2.1. It is crossjoin.

Has anyone experienced such situation or maybe have suggestion why it could have happened? I have a feeling that it has something to do with new or changed default spark properties, but could not identify which ones.

Also, spark.conf.set("spark.databricks.io.cache.enabled", "true").