topic Re: What is the correct way to measure the performance of a Databrick notebook? in Data Engineering

What is the correct way to measure the performance of a Databrick notebook?

guangyi — Fri, 15 Nov 2024 05:11:03 GMT

Here is my code for converting one column field of a data frame to time data type:

col_value = df.select(df.columns[0]).first()[0] start_time = time.time() col_value = datetime.strftime(col_value, "%Y-%m-%d %H:%M:%S") \ if isinstance(col_value, datetime) \ else col_value end_time = time.time() elapsed_time = end_time - start_time print(elapsed_time)

The elapsed_time value is 0.00011396408081054688, which make sense it should cost little effort.

However, after I put this code inside a Python loop, things turn strange. Here is the code

for col in df.columns: col_value = df.select(col).first()[0] start_time = time.time() col_value = datetime.strftime(col_value, "%Y-%m-%d %H:%M:%S") \ if isinstance(col_value, datetime) \ else col_value end_time = time.time() elapsed_time = end_time - start_time print(elapsed_time)

After finishing running this code, I found the elapsed_time increased to 5 seconds!

Then I remove the time convert logic and re-do the statistics again including the whole loop:

loop_start_time = time.time() for col in col_list: start_time = time.time() # Nothing was done here end_time = time.time() elapsed_time = end_time - start_time print(col, elapsed_time) loop_end_time = time.time() loop_elapsed_time = loop_end_time - loop_start_time print(f"Loop time cost {loop_elapsed_time} seconds")

It looks like each round without logic will cost more than 5 seconds as well, but the whole loop only costs 0.0026 second.

Why did this happen? Did I miss something? What is the correct why to measure the cost of each Python statement and function?

Re: What is the correct way to measure the performance of a Databrick notebook?

Walter_C — Fri, 15 Nov 2024 16:09:21 GMT

The behavior you're observing is likely due to a combination of factors related to how Python executes code and how time is measured. Let's break down the issues and provide some recommendations for more accurate timing:

Resolution of time.time():
The resolution of time.time() is typically around 1 microsecond on most systems. For very fast operations, this might not be accurate enough.
Overhead of function calls:
Calling time.time() itself takes some time, which can be significant for very fast operations.
Python interpreter overhead:
The Python interpreter introduces some overhead, especially when executing small pieces of code repeatedly.
System-level scheduling:
The operating system may introduce delays between iterations of your loop, leading to inconsistent measurements.
JIT compilation (if using PyPy):
If you're using PyPy, just-in-time compilation can cause timing variations.

For this you can try:

Use timeit module:
The timeit module is designed for benchmarking small code snippets and handles many of the issues mentioned above.
Use time.perf_counter():
For more precise timing, use time.perf_counter() instead of time.time(). It provides higher resolution and is monotonic.

Re: What is the correct way to measure the performance of a Databrick notebook?

holly — Mon, 18 Nov 2024 11:16:53 GMT

You've asked what is the correct way to measure performance - whilst the above is correct, it won't necessarily help you for production ETL. When jobs start running into hours you need to start using the spark UI and looking at cluster metrics to understand how your work is being distributed across the nodes.

Re: What is the correct way to measure the performance of a Databrick notebook?

Lakshay — Mon, 18 Nov 2024 16:44:50 GMT

How many columns do you have?

Re: What is the correct way to measure the performance of a Databrick notebook?

guangyi — Tue, 19 Nov 2024 08:28:25 GMT

Thank you for your advice. I found some learning material about how to do performance profiing in Python. The cProfile get correct result for me.