<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How do you analyze performance in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/how-do-you-analyze-performance/m-p/73897#M3120</link>
    <description>&lt;P&gt;Curious to hear how you guys optimize compute. As in how you dig into the details of the Spark execution and improve?&lt;/P&gt;</description>
    <pubDate>Thu, 13 Jun 2024 20:13:01 GMT</pubDate>
    <dc:creator>Newbienewbster</dc:creator>
    <dc:date>2024-06-13T20:13:01Z</dc:date>
    <item>
      <title>How do you analyze performance</title>
      <link>https://community.databricks.com/t5/get-started-discussions/how-do-you-analyze-performance/m-p/73897#M3120</link>
      <description>&lt;P&gt;Curious to hear how you guys optimize compute. As in how you dig into the details of the Spark execution and improve?&lt;/P&gt;</description>
      <pubDate>Thu, 13 Jun 2024 20:13:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/how-do-you-analyze-performance/m-p/73897#M3120</guid>
      <dc:creator>Newbienewbster</dc:creator>
      <dc:date>2024-06-13T20:13:01Z</dc:date>
    </item>
    <item>
      <title>Re: How do you analyze performance</title>
      <link>https://community.databricks.com/t5/get-started-discussions/how-do-you-analyze-performance/m-p/74006#M3121</link>
      <description>&lt;P&gt;That is it. Usually, people take the time it takes to run a job/query/process as their KPI.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Then you start to check which processes are taking more time, drilling down one by one. Sometimes it could be a misplaced .cache(), .collect() or display() that makes spark effectively calculate everything. You could also do the same for queries with the query profiler, checking whether there was shuffle, how many rows are being processed and whether there was disk spill. You can also check for skewness.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I really like this blog:&amp;nbsp;&lt;A href="https://www.databricks.com/discover/pages/optimize-data-workloads-guide" target="_blank"&gt;https://www.databricks.com/discover/pages/optimize-data-workloads-guide&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 14 Jun 2024 10:39:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/how-do-you-analyze-performance/m-p/74006#M3121</guid>
      <dc:creator>mhiltner</dc:creator>
      <dc:date>2024-06-14T10:39:08Z</dc:date>
    </item>
  </channel>
</rss>

