Databricks Community

guangyi · ‎11-14-2024

Here is my code for converting one column field of a data frame to time data type:

col_value = df.select(df.columns[0]).first()[0]


start_time = time.time()
col_value = datetime.strftime(col_value, "%Y-%m-%d %H:%M:%S") \
           if isinstance(col_value, datetime) \
           else col_value
end_time = time.time()
elapsed_time = end_time - start_time
print(elapsed_time)

The elapsed_time value is 0.00011396408081054688, which make sense it should cost little effort.

However, after I put this code inside a Python loop, things turn strange. Here is the code

for col in df.columns:
  col_value = df.select(col).first()[0]

  start_time = time.time()
  col_value = datetime.strftime(col_value, "%Y-%m-%d %H:%M:%S") \
              if isinstance(col_value, datetime) \
              else col_value
  end_time = time.time()
  elapsed_time = end_time - start_time
  print(elapsed_time)

After finishing running this code, I found the elapsed_time increased to 5 seconds!

Then I remove the time convert logic and re-do the statistics again including the whole loop:

loop_start_time = time.time()
for col in col_list:
  start_time = time.time()
  # Nothing was done here
  end_time = time.time()
  elapsed_time = end_time - start_time
  print(col, elapsed_time)

loop_end_time = time.time()
loop_elapsed_time = loop_end_time - loop_start_time
print(f"Loop time cost {loop_elapsed_time} seconds")

It looks like each round without logic will cost more than 5 seconds as well, but the whole loop only costs 0.0026 second.

Why did this happen? Did I miss something? What is the correct why to measure the cost of each Python statement and function?

Walter_C · ‎11-15-2024

The behavior you're observing is likely due to a combination of factors related to how Python executes code and how time is measured. Let's break down the issues and provide some recommendations for more accurate timing:

Resolution of time.time():
The resolution of time.time() is typically around 1 microsecond on most systems. For very fast operations, this might not be accurate enough.
Overhead of function calls:
Calling time.time() itself takes some time, which can be significant for very fast operations.
Python interpreter overhead:
The Python interpreter introduces some overhead, especially when executing small pieces of code repeatedly.
System-level scheduling:
The operating system may introduce delays between iterations of your loop, leading to inconsistent measurements.
JIT compilation (if using PyPy):
If you're using PyPy, just-in-time compilation can cause timing variations.

For this you can try:

Use timeit module:
The timeit module is designed for benchmarking small code snippets and handles many of the issues mentioned above.
Use time.perf_counter():
For more precise timing, use time.perf_counter() instead of time.time(). It provides higher resolution and is monotonic.

View solution in original post

Walter_C · ‎11-15-2024

The behavior you're observing is likely due to a combination of factors related to how Python executes code and how time is measured. Let's break down the issues and provide some recommendations for more accurate timing:

Resolution of time.time():
The resolution of time.time() is typically around 1 microsecond on most systems. For very fast operations, this might not be accurate enough.
Overhead of function calls:
Calling time.time() itself takes some time, which can be significant for very fast operations.
Python interpreter overhead:
The Python interpreter introduces some overhead, especially when executing small pieces of code repeatedly.
System-level scheduling:
The operating system may introduce delays between iterations of your loop, leading to inconsistent measurements.
JIT compilation (if using PyPy):
If you're using PyPy, just-in-time compilation can cause timing variations.

For this you can try:

Use timeit module:
The timeit module is designed for benchmarking small code snippets and handles many of the issues mentioned above.
Use time.perf_counter():
For more precise timing, use time.perf_counter() instead of time.time(). It provides higher resolution and is monotonic.

guangyi · ‎11-19-2024

Thank you for your advice. I found some learning material about how to do performance profiing in Python. The cProfile get correct result for me.

holly · ‎11-18-2024

You've asked what is the correct way to measure performance - whilst the above is correct, it won't necessarily help you for production ETL. When jobs start running into hours you need to start using the spark UI and looking at cluster metrics to understand how your work is being distributed across the nodes.