Here is my code for converting one column field of a data frame to time data type:
col_value = df.select(df.columns[0]).first()[0]
start_time = time.time()
col_value = datetime.strftime(col_value, "%Y-%m-%d %H:%M:%S") \
if isinstance(col_value, datetime) \
else col_value
end_time = time.time()
elapsed_time = end_time - start_time
print(elapsed_time)
The elapsed_time value is 0.00011396408081054688, which make sense it should cost little effort.
However, after I put this code inside a Python loop, things turn strange. Here is the code
for col in df.columns:
col_value = df.select(col).first()[0]
start_time = time.time()
col_value = datetime.strftime(col_value, "%Y-%m-%d %H:%M:%S") \
if isinstance(col_value, datetime) \
else col_value
end_time = time.time()
elapsed_time = end_time - start_time
print(elapsed_time)
After finishing running this code, I found the elapsed_time increased to 5 seconds!
Then I remove the time convert logic and re-do the statistics again including the whole loop:
loop_start_time = time.time()
for col in col_list:
start_time = time.time()
# Nothing was done here
end_time = time.time()
elapsed_time = end_time - start_time
print(col, elapsed_time)
loop_end_time = time.time()
loop_elapsed_time = loop_end_time - loop_start_time
print(f"Loop time cost {loop_elapsed_time} seconds")
It looks like each round without logic will cost more than 5 seconds as well, but the whole loop only costs 0.0026 second.
Why did this happen? Did I miss something? What is the correct why to measure the cost of each Python statement and function?