<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Unexpected performance behaviors due to changes in the Spark engine or Databricks runtime in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/unexpected-performance-behaviors-due-to-changes-in-the-spark/m-p/45174#M27803</link>
    <description>&lt;P&gt;Hi!&lt;BR /&gt;&lt;BR /&gt;We have recently upgraded our cluster from&amp;nbsp;&lt;SPAN&gt;Databricks Runtime 10.4 LTS which includes Apache Spark 3.2.1 to to Databricks Runtime 13.3 LTSincludes Apache Spark 3.2.1&amp;nbsp;powered by Apache Spark 3.3.0 and noticed that one of our jobs runtime has dramatically increased (actually it was terminated before it even finished).&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;On 3.2.1 the job would run aprox. 40 mins, while on 3.3.0 it was terminated after 3 hours. According to driver logs, it stuck on one stage which had 160000 tasks (TaskSetManager: Finished task 18443.0 in stage 204.0 in 8746 ms on (executor 1) (17678/160000)) while the run on 3.2.1 would never reach that number - max. a few hundred tasks. Also worth mentioning that disk was expanding to the hights I never seen.&lt;BR /&gt;&lt;BR /&gt;I was able to find the function where behaviour of 3.3.0 has changed compared to 3.2.1. It is &lt;STRONG&gt;crossjoin&lt;/STRONG&gt;.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Has anyone experienced such situation or maybe have suggestion why it could have happened? I have a feeling that it has something to do with new or changed default spark properties, but could not identify which ones.&lt;BR /&gt;&lt;BR /&gt;Also,&amp;nbsp;spark.conf.set("spark.databricks.io.cache.enabled", "true").&lt;/P&gt;</description>
    <pubDate>Sun, 17 Sep 2023 20:37:28 GMT</pubDate>
    <dc:creator>Direo</dc:creator>
    <dc:date>2023-09-17T20:37:28Z</dc:date>
    <item>
      <title>Unexpected performance behaviors due to changes in the Spark engine or Databricks runtime</title>
      <link>https://community.databricks.com/t5/data-engineering/unexpected-performance-behaviors-due-to-changes-in-the-spark/m-p/45174#M27803</link>
      <description>&lt;P&gt;Hi!&lt;BR /&gt;&lt;BR /&gt;We have recently upgraded our cluster from&amp;nbsp;&lt;SPAN&gt;Databricks Runtime 10.4 LTS which includes Apache Spark 3.2.1 to to Databricks Runtime 13.3 LTSincludes Apache Spark 3.2.1&amp;nbsp;powered by Apache Spark 3.3.0 and noticed that one of our jobs runtime has dramatically increased (actually it was terminated before it even finished).&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;On 3.2.1 the job would run aprox. 40 mins, while on 3.3.0 it was terminated after 3 hours. According to driver logs, it stuck on one stage which had 160000 tasks (TaskSetManager: Finished task 18443.0 in stage 204.0 in 8746 ms on (executor 1) (17678/160000)) while the run on 3.2.1 would never reach that number - max. a few hundred tasks. Also worth mentioning that disk was expanding to the hights I never seen.&lt;BR /&gt;&lt;BR /&gt;I was able to find the function where behaviour of 3.3.0 has changed compared to 3.2.1. It is &lt;STRONG&gt;crossjoin&lt;/STRONG&gt;.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Has anyone experienced such situation or maybe have suggestion why it could have happened? I have a feeling that it has something to do with new or changed default spark properties, but could not identify which ones.&lt;BR /&gt;&lt;BR /&gt;Also,&amp;nbsp;spark.conf.set("spark.databricks.io.cache.enabled", "true").&lt;/P&gt;</description>
      <pubDate>Sun, 17 Sep 2023 20:37:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unexpected-performance-behaviors-due-to-changes-in-the-spark/m-p/45174#M27803</guid>
      <dc:creator>Direo</dc:creator>
      <dc:date>2023-09-17T20:37:28Z</dc:date>
    </item>
    <item>
      <title>Re: Unexpected performance behaviors due to changes in the Spark engine or Databricks runtime</title>
      <link>https://community.databricks.com/t5/data-engineering/unexpected-performance-behaviors-due-to-changes-in-the-spark/m-p/45218#M27817</link>
      <description>&lt;P&gt;Seems that broadcasting the smaller table in crossjoin did the magic.&lt;/P&gt;</description>
      <pubDate>Mon, 18 Sep 2023 10:24:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unexpected-performance-behaviors-due-to-changes-in-the-spark/m-p/45218#M27817</guid>
      <dc:creator>Direo</dc:creator>
      <dc:date>2023-09-18T10:24:37Z</dc:date>
    </item>
  </channel>
</rss>

