<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: What is the correct way to measure the performance of a Databrick notebook? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/what-is-the-correct-way-to-measure-the-performance-of-a/m-p/98967#M39875</link>
    <description>&lt;P&gt;&lt;SPAN&gt;The behavior you're observing is likely due to a combination of factors related to how Python executes code and how time is measured. Let's break down the issues and provide some recommendations for more accurate timing:&lt;/SPAN&gt;&lt;/P&gt;
&lt;OL class="marker:text-textOff list-decimal pl-8"&gt;
&lt;LI&gt;&lt;SPAN&gt;Resolution of time.time():&lt;BR /&gt;The resolution of time.time() is typically around 1 microsecond on most systems. For very fast operations, this might not be accurate enough.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Overhead of function calls:&lt;BR /&gt;Calling time.time() itself takes some time, which can be significant for very fast operations.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Python interpreter overhead:&lt;BR /&gt;The Python interpreter introduces some overhead, especially when executing small pieces of code repeatedly.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;System-level scheduling:&lt;BR /&gt;The operating system may introduce delays between iterations of your loop, leading to inconsistent measurements.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;JIT compilation (if using PyPy):&lt;BR /&gt;If you're using PyPy, just-in-time compilation can cause timing variations.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;For this you can try:&lt;/P&gt;
&lt;OL class="marker:text-textOff list-decimal pl-8"&gt;
&lt;LI&gt;&lt;SPAN&gt;Use timeit module:&lt;/SPAN&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;The timeit module is designed for benchmarking small code snippets and handles many of the issues mentioned above.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Use time.perf_counter():&lt;/SPAN&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;For more precise timing, use time.perf_counter() instead of time.time(). It provides higher resolution and is monotonic.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;</description>
    <pubDate>Fri, 15 Nov 2024 16:09:21 GMT</pubDate>
    <dc:creator>Walter_C</dc:creator>
    <dc:date>2024-11-15T16:09:21Z</dc:date>
    <item>
      <title>What is the correct way to measure the performance of a Databrick notebook?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-is-the-correct-way-to-measure-the-performance-of-a/m-p/98860#M39861</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Here is my code for converting one column field of a data frame to time data type:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;col_value = df.select(df.columns[0]).first()[0]


start_time = time.time()
col_value = datetime.strftime(col_value, "%Y-%m-%d %H:%M:%S") \
           if isinstance(col_value, datetime) \
           else col_value
end_time = time.time()
elapsed_time = end_time - start_time
print(elapsed_time) &lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The elapsed_time value is 0.00011396408081054688, which make sense it should cost little effort.&lt;/P&gt;&lt;P&gt;However, after I put this code inside a Python loop, things turn strange. Here is the code&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;for col in df.columns:
  col_value = df.select(col).first()[0]

  start_time = time.time()
  col_value = datetime.strftime(col_value, "%Y-%m-%d %H:%M:%S") \
              if isinstance(col_value, datetime) \
              else col_value
  end_time = time.time()
  elapsed_time = end_time - start_time
  print(elapsed_time)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;After finishing running this code, I found the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;elapsed_time&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;increased to 5 seconds!&lt;/P&gt;&lt;P&gt;Then I remove the time convert logic and re-do the statistics again including the whole loop:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;loop_start_time = time.time()
for col in col_list:
  start_time = time.time()
  # Nothing was done here
  end_time = time.time()
  elapsed_time = end_time - start_time
  print(col, elapsed_time)

loop_end_time = time.time()
loop_elapsed_time = loop_end_time - loop_start_time
print(f"Loop time cost {loop_elapsed_time} seconds")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It looks like each round without logic will cost more than 5 seconds as well, but the whole loop only costs 0.0026 second.&lt;/P&gt;&lt;P&gt;Why did this happen? Did I miss something? What is the correct why to measure the cost of each Python statement and function?&lt;/P&gt;</description>
      <pubDate>Fri, 15 Nov 2024 05:11:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-is-the-correct-way-to-measure-the-performance-of-a/m-p/98860#M39861</guid>
      <dc:creator>guangyi</dc:creator>
      <dc:date>2024-11-15T05:11:03Z</dc:date>
    </item>
    <item>
      <title>Re: What is the correct way to measure the performance of a Databrick notebook?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-is-the-correct-way-to-measure-the-performance-of-a/m-p/98967#M39875</link>
      <description>&lt;P&gt;&lt;SPAN&gt;The behavior you're observing is likely due to a combination of factors related to how Python executes code and how time is measured. Let's break down the issues and provide some recommendations for more accurate timing:&lt;/SPAN&gt;&lt;/P&gt;
&lt;OL class="marker:text-textOff list-decimal pl-8"&gt;
&lt;LI&gt;&lt;SPAN&gt;Resolution of time.time():&lt;BR /&gt;The resolution of time.time() is typically around 1 microsecond on most systems. For very fast operations, this might not be accurate enough.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Overhead of function calls:&lt;BR /&gt;Calling time.time() itself takes some time, which can be significant for very fast operations.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Python interpreter overhead:&lt;BR /&gt;The Python interpreter introduces some overhead, especially when executing small pieces of code repeatedly.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;System-level scheduling:&lt;BR /&gt;The operating system may introduce delays between iterations of your loop, leading to inconsistent measurements.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;JIT compilation (if using PyPy):&lt;BR /&gt;If you're using PyPy, just-in-time compilation can cause timing variations.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;For this you can try:&lt;/P&gt;
&lt;OL class="marker:text-textOff list-decimal pl-8"&gt;
&lt;LI&gt;&lt;SPAN&gt;Use timeit module:&lt;/SPAN&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;The timeit module is designed for benchmarking small code snippets and handles many of the issues mentioned above.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Use time.perf_counter():&lt;/SPAN&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;For more precise timing, use time.perf_counter() instead of time.time(). It provides higher resolution and is monotonic.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Fri, 15 Nov 2024 16:09:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-is-the-correct-way-to-measure-the-performance-of-a/m-p/98967#M39875</guid>
      <dc:creator>Walter_C</dc:creator>
      <dc:date>2024-11-15T16:09:21Z</dc:date>
    </item>
    <item>
      <title>Re: What is the correct way to measure the performance of a Databrick notebook?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-is-the-correct-way-to-measure-the-performance-of-a/m-p/99136#M39913</link>
      <description>&lt;P&gt;You've asked what is the correct way to measure performance - whilst the above is correct, it won't necessarily help you for production ETL. When jobs start running into hours you need to start using the spark UI and looking at cluster metrics to understand how your work is being distributed across the nodes.&lt;/P&gt;</description>
      <pubDate>Mon, 18 Nov 2024 11:16:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-is-the-correct-way-to-measure-the-performance-of-a/m-p/99136#M39913</guid>
      <dc:creator>holly</dc:creator>
      <dc:date>2024-11-18T11:16:53Z</dc:date>
    </item>
    <item>
      <title>Re: What is the correct way to measure the performance of a Databrick notebook?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-is-the-correct-way-to-measure-the-performance-of-a/m-p/99216#M39927</link>
      <description>&lt;P&gt;How many columns do you have?&lt;/P&gt;</description>
      <pubDate>Mon, 18 Nov 2024 16:44:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-is-the-correct-way-to-measure-the-performance-of-a/m-p/99216#M39927</guid>
      <dc:creator>Lakshay</dc:creator>
      <dc:date>2024-11-18T16:44:50Z</dc:date>
    </item>
    <item>
      <title>Re: What is the correct way to measure the performance of a Databrick notebook?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-is-the-correct-way-to-measure-the-performance-of-a/m-p/99298#M39953</link>
      <description>&lt;P&gt;Thank you for your advice. I found some learning material about how to do performance profiing in Python. The cProfile get correct result for me.&lt;/P&gt;</description>
      <pubDate>Tue, 19 Nov 2024 08:28:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-is-the-correct-way-to-measure-the-performance-of-a/m-p/99298#M39953</guid>
      <dc:creator>guangyi</dc:creator>
      <dc:date>2024-11-19T08:28:25Z</dc:date>
    </item>
  </channel>
</rss>

