cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

What is the correct way to measure the performance of a Databrick notebook?

guangyi
Contributor III

Here is my code for converting one column field of a data frame to time data type:

 

 

col_value = df.select(df.columns[0]).first()[0]


start_time = time.time()
col_value = datetime.strftime(col_value, "%Y-%m-%d %H:%M:%S") \
           if isinstance(col_value, datetime) \
           else col_value
end_time = time.time()
elapsed_time = end_time - start_time
print(elapsed_time) 

 

 

The elapsed_time value is 0.00011396408081054688, which make sense it should cost little effort.

However, after I put this code inside a Python loop, things turn strange. Here is the code

 

 

for col in df.columns:
  col_value = df.select(col).first()[0]

  start_time = time.time()
  col_value = datetime.strftime(col_value, "%Y-%m-%d %H:%M:%S") \
              if isinstance(col_value, datetime) \
              else col_value
  end_time = time.time()
  elapsed_time = end_time - start_time
  print(elapsed_time)

 

 

After finishing running this code, I found the elapsed_time increased to 5 seconds!

Then I remove the time convert logic and re-do the statistics again including the whole loop:

 

 

loop_start_time = time.time()
for col in col_list:
  start_time = time.time()
  # Nothing was done here
  end_time = time.time()
  elapsed_time = end_time - start_time
  print(col, elapsed_time)

loop_end_time = time.time()
loop_elapsed_time = loop_end_time - loop_start_time
print(f"Loop time cost {loop_elapsed_time} seconds")

 

 

It looks like each round without logic will cost more than 5 seconds as well, but the whole loop only costs 0.0026 second.

Why did this happen? Did I miss something? What is the correct why to measure the cost of each Python statement and function?

1 ACCEPTED SOLUTION

Accepted Solutions

Walter_C
Databricks Employee
Databricks Employee

The behavior you're observing is likely due to a combination of factors related to how Python executes code and how time is measured. Let's break down the issues and provide some recommendations for more accurate timing:

  1. Resolution of time.time():
    The resolution of time.time() is typically around 1 microsecond on most systems. For very fast operations, this might not be accurate enough.
  2. Overhead of function calls:
    Calling time.time() itself takes some time, which can be significant for very fast operations.
  3. Python interpreter overhead:
    The Python interpreter introduces some overhead, especially when executing small pieces of code repeatedly.
  4. System-level scheduling:
    The operating system may introduce delays between iterations of your loop, leading to inconsistent measurements.
  5. JIT compilation (if using PyPy):
    If you're using PyPy, just-in-time compilation can cause timing variations.


For this you can try:

  1. Use timeit module:
    The timeit module is designed for benchmarking small code snippets and handles many of the issues mentioned above.
  2. Use time.perf_counter():
    For more precise timing, use time.perf_counter() instead of time.time(). It provides higher resolution and is monotonic.

View solution in original post

4 REPLIES 4

Walter_C
Databricks Employee
Databricks Employee

The behavior you're observing is likely due to a combination of factors related to how Python executes code and how time is measured. Let's break down the issues and provide some recommendations for more accurate timing:

  1. Resolution of time.time():
    The resolution of time.time() is typically around 1 microsecond on most systems. For very fast operations, this might not be accurate enough.
  2. Overhead of function calls:
    Calling time.time() itself takes some time, which can be significant for very fast operations.
  3. Python interpreter overhead:
    The Python interpreter introduces some overhead, especially when executing small pieces of code repeatedly.
  4. System-level scheduling:
    The operating system may introduce delays between iterations of your loop, leading to inconsistent measurements.
  5. JIT compilation (if using PyPy):
    If you're using PyPy, just-in-time compilation can cause timing variations.


For this you can try:

  1. Use timeit module:
    The timeit module is designed for benchmarking small code snippets and handles many of the issues mentioned above.
  2. Use time.perf_counter():
    For more precise timing, use time.perf_counter() instead of time.time(). It provides higher resolution and is monotonic.

Thank you for your advice. I found some learning material about how to do performance profiing in Python. The cProfile get correct result for me.

holly
Databricks Employee
Databricks Employee

You've asked what is the correct way to measure performance - whilst the above is correct, it won't necessarily help you for production ETL. When jobs start running into hours you need to start using the spark UI and looking at cluster metrics to understand how your work is being distributed across the nodes.

Lakshay
Databricks Employee
Databricks Employee

How many columns do you have?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group